You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2022/08/18 00:49:59 UTC

[GitHub] [tvm-rfcs] YuchenJin opened a new pull request, #89: [RFC] Relax Upstreaming

YuchenJin opened a new pull request, #89:
URL: https://github.com/apache/tvm-rfcs/pull/89

   This RFC proposes to upstream the core foundation of Relax including its IR, compilation flow, and runtime, to address the critical needs identified by the TVM community, and enable a cohesive (but optional) [TVM Unity Connection](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) milestone.
   
   [Rendered version](https://github.com/YuchenJin/tvm-rfcs/blob/relax-upstream-rfc/rfcs/0089-relax-upstreaming.md)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949758752


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:

Review Comment:
   Performance here means:
   
   - **Compilation performance:** defined as the effectiveness of Relax’s optimization passes in transforming and optimizing Relax code.
   - **Runtime performance:** defined as the runtime latency for various Relax workloads, whether they are subgraphs or end-to-end models.
   
   It can be found in the Relax roadmap rfc: https://github.com/apache/tvm-rfcs/pull/69. Maybe I should attach a link to it in this RFC.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] masahi commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

masahi commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1223774454

   @YuchenJin 
   
   > Relax can be viewed as complementary to Relay. Relay focuses on high-level op transformations, while the current Relax passes focus on TIR-graph co-transformations that can enable flexible fusion and layout rewrite, which is hard to achieve in Relay.
   
   I like this separation of work between Relay / Relax. We have many Relay passes that work all right and for which it doesn't make a lot of sense to reimplement in Relax. 
   
   But if Relax is supposed to be complementary to Relay, why do we keep calling it Relax, as "Relay Next"? "Relay Next" strongly suggests that Relax is something that is going to replace Relay, like we did for nnvm. I'm still not entirely clear if the plan is to eventually duplicate Relay, or Relay and Relax are going to coexist for foreseeable future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r952996763


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:

Review Comment:
   Hi @Mousius, the expressibility here does not mean the support for TVMScript. It means the ability of the IR to express workloads. For example, we need to be able to express symbolic shapes in the IR to support and optimize dynamic shape workloads; we need to be able to express side effects to support training workloads that make inplace updates to the model’s weights during backpropagation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r952934825


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.

Review Comment:
   `call_tir` allows a `PrimFunc` to be called from a Relax program. It allocates the destination tensor first and invokes the `PrimFunc` in destination-passing style. It's implemented as an operator in Relax, so it's a specific kind of call. Normal `Call` nodes do not perform that allocation behavior and do not assume a destination-passing style convention (`call_tir` is really just syntactic sugar for performing the allocation and making the call--as long as the `PrimFunc` has a global symbol, it can be called from Relax using the `ExternFunc` construct in the AST).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949611408


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.

Review Comment:
   Representing shapes is part of the language, so it is not independent of IR. Relay handles shapes in its type relation system and it would be very difficult to support symbolic shapes in Relay's system, since these are supposed to be checked dynamically. (Right now, Relay has an `Any` shape for dimensions that should be checked dynamically, but that's very all-or-nothing; you can't specify finer-grained shape constraints that way.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1224450123

   On the point about potentially incorporating symbolic shapes into Relay, I would like to hear more detail about how it can be done with Relay's system of accumulating type constraints and solving them simultaneously. If we were to handle dynamic shapes in Relay, we would need to define semantics for how shape variables are scoped and how assignments are handled, how they can be processed during the solving of type constraints, and what happens if symbolic shape expressions cannot be concluded to be the same at compile time. If this can be neatly incorporated into Relay, then it might make sense to pursue. I would be happy to brainstorm on that issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950884475


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.

Review Comment:
   Yes, we can get the concrete values of `m` and `n` by invoking `shape_of(A)` at run time, and store them in the shape heap of the vm.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950883626


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)

Review Comment:
   In Relax, we decouple the optimization passes from the build system (we call it [minimum build](https://github.com/YuchenJin/tvm-rfcs/blob/relax-upstream-rfc/rfcs/0089-relax-upstreaming.md#43-relax-minimum-compilation-flow)). The goal of this decoupling is to enable flexible and customizable compilation pipelines. For example in MetaSchedule tuning (which will be an optimization pass in Relax), the tuning system can call `vm.build` to generate the executable and evaluate the run time of a certain schedule.
   
   We ensure each pass is a `IRModule` → `IRModule` transformation, and user can call `vm.build` at any point of their customized pass(es). I believe we will also have a `vm.compile` api which incorporates a series of default optimization passes and the minimum build.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] leandron commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

leandron commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950339364


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+

Review Comment:
   Thanks for the points @slyubomirsky @Hzfengsy and @sunggg. I won't have the time to reply to all comments today, but wanted to come back on this thread.
   
   > IMHO, joint-optimization across multiple graph tuners (e.g., TASO+Collage) would be practically impossible.
   
   @sunggg I know what Collage is, but I'm not familiar with what is TASO in TVM. Can you give a little context about it and what is the expected use case involving TASO and Collage?
   
   > If we want to add it to relay, we should:
   > - add new AST nodes
   > - rewrite the build pipeline (i.e. we should first include tir function to the Module for further analysis)
   > - rewrite most of the passes (i.e. we need to do optimization with a Module that has both tir and relay function)
   > 
   > Based on that, making a new IR may be more reasonable.
   
   @Hzfengsy I understand you point, but at the same time, in adding a new IR that plans to deprecate the existing IR with planned overlapping features, are we not basically doing the same things as pointed out on your list, just on a new IR perspective, which impacts - in the long term - all existing ASTs, pipelines and passes?
   
   @YuchenJin, given the complexity of this RFC, I think it would be good to have it amended with a "Rationale and Alternatives" section, similar to what we have in the template: https://github.com/apache/tvm-rfcs/blob/main/0000-template.md#rationale-and-alternatives



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] Hzfengsy commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

Hzfengsy commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950160715


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.
+- Calls into `te.create_prim_func` to create a TIR PrimFunc.
+- Generates a call into the generated TIR PrimFunc via `call_tir`.
+
+Bridging Relax and TIR is simple and clean given that Relax has symbolic shape as first class and the support for `call_tir` for cross-layer interactions.
+
+**Relay → Relax translator**
+
+To immediately boost the coverage of models and leverage existing relay optimizations, a Relay-to-Relax translator is implemented. The translator visits the Relay graph in post-order, lowers relay ops to their TOPI functions using `OpStrategy`, and uses `emit_te` to generate corresponding TIR PrimFuncs and a Relax `main` function that contains a sequence of `call_tir` that call into these generated TIR Primfuncs. 
+
+## 4.5 PR list
+
+We plan to split the upstreaming into the following manageable PRs for TVM community review:
+
+- Relax IR
+- Relax VM
+- BlockBuilder
+- ExprFunctor/ExprVisitor/ExprMutator/IRFunctor
+- Relay → Relax translator
+- Minimum build (4 passes)
+- VM Codegen
+- E2E model compilation + execution
+
+# 5. **Future work**
+
+This RFC only focuses on the foundation part of Relax. After it, we will incrementally incorporate the additional capabilities and features. Relax aims to achieve parity with the functionality provided by Relay: this means that workloads which are functional on Relay will also be functional on Relax, even though the infrastructure underneath may change.
+
+Future plans that we will bring in future RFCs:
+
+- AOT: AOT compilation has a wide range of benefits such as being more space efficient and is necessary for resource-constrained projects like uTVM. We are committed to continuously supporting the AOT compilation in Relax, and there is an ongoing effort to connect Relax to the current AOT executor.
+- BYOC: We will try to reuse the existing translation spec. In Relax, BYOC can be formalized as a pass that call external packed functions.

Review Comment:
   [BYOC](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344#byoc-9) and [AOT](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344#solidifying-aot-11) are explained in here. It's good to copy the content to this RFC



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.
+- Calls into `te.create_prim_func` to create a TIR PrimFunc.
+- Generates a call into the generated TIR PrimFunc via `call_tir`.
+
+Bridging Relax and TIR is simple and clean given that Relax has symbolic shape as first class and the support for `call_tir` for cross-layer interactions.
+
+**Relay → Relax translator**
+
+To immediately boost the coverage of models and leverage existing relay optimizations, a Relay-to-Relax translator is implemented. The translator visits the Relay graph in post-order, lowers relay ops to their TOPI functions using `OpStrategy`, and uses `emit_te` to generate corresponding TIR PrimFuncs and a Relax `main` function that contains a sequence of `call_tir` that call into these generated TIR Primfuncs. 
+
+## 4.5 PR list
+
+We plan to split the upstreaming into the following manageable PRs for TVM community review:
+
+- Relax IR
+- Relax VM
+- BlockBuilder
+- ExprFunctor/ExprVisitor/ExprMutator/IRFunctor
+- Relay → Relax translator
+- Minimum build (4 passes)
+- VM Codegen
+- E2E model compilation + execution
+
+# 5. **Future work**
+
+This RFC only focuses on the foundation part of Relax. After it, we will incrementally incorporate the additional capabilities and features. Relax aims to achieve parity with the functionality provided by Relay: this means that workloads which are functional on Relay will also be functional on Relax, even though the infrastructure underneath may change.
+
+Future plans that we will bring in future RFCs:
+
+- AOT: AOT compilation has a wide range of benefits such as being more space efficient and is necessary for resource-constrained projects like uTVM. We are committed to continuously supporting the AOT compilation in Relax, and there is an ongoing effort to connect Relax to the current AOT executor.
+- BYOC: We will try to reuse the existing translation spec. In Relax, BYOC can be formalized as a pass that call external packed functions.

Review Comment:
   [BYOC](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344#byoc-9) and [AOT](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344#solidifying-aot-11) are explained here. It's good to copy the content to this RFC



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] Hzfengsy commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

Hzfengsy commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950144597


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:

Review Comment:
   I'd like to explain more about the **Runtime performance**
   
   For a single operator or a single subgraph, relax uses Meta-schedule, which is the same as what relay does. However, relax gives us an opportunity to break the boundary between Graph-Op and low-level and enable optimization across the layer - for example, the **layout rewrite** (as I mentioned at https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344/4).
   
   @jinhongyii has finished a prototype for automatic (data and weight) layout rewrite, which has about 10% speedup. 
   
   However, relay has no such ability to support it. Because:
   1. relay defines the layout at graph level and can not get feedback from the low-level operators
   2. We must register each possible layout to the relay operator, which is impossible. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r980379409


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+

Review Comment:
   Thanks @leandron for the suggestion! We have added the [Rationale and Alternatives section](https://github.com/YuchenJin/tvm-rfcs/blob/relax-upstream-rfc/rfcs/0089-relax-upstreaming.md#6-rationale-and-alternatives) to the RFC, please take a look. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on pull request #89: [RFC] Relax Upstreaming

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1735749718

   Thanks @FrozenGene for bring this up! To bring broader awareness, we posted a new strategy proposal here https://discuss.tvm.apache.org/t/discuss-tvm-core-strategy-for-emerging-needs/15751 to concretely enable LLMs and other usecases


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] junrushao commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

junrushao commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1381139517

   My position:
   - Relay and Relax is going to co-exist as parallel submodules in TVM, and one should not affect the other at all;
   - Committed to keeping Relay source code in "main" in the foreseeable future without hinting about potential deprecation;
   - Having Relax in "main" >>> having Relax in a separate branch > not having Relax at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r951902969


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 

Review Comment:
   I think it could make sense to include some broad notes about how typing is intended to work in Relax. Overall, the type system is simpler than in Relay: Tensor types track only the rank and dtype of tensors (both optional) and shape is approached separately from type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950885049


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.

Review Comment:
   We will certainly introduce some liveness analysis and memory planning passes to decide when storages/tensors shoule be freed.
   
   Thanks for the device placement pass suggestion, we will likely reuse it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950339476


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.

Review Comment:
   Thanks for the comment! Yes, it has been supported in Relay but there are some nontrivial limitation, imo. 
   
   - (1) Relay main pipeline lowers every Relay IRs into TIR at once at their IR boundary. This makes partial lowering (lower only part of the graph) difficult in the main pipeline. 
   - (2) Relay main pipeline supports lowering with `OpStrategy`. However, it is not necessarily easy to customize it (custom lowering)
   
   For these reasons, people introduced `RelayToTIR` and `RelayToRuntime` that essentially bypass the main pipeline. Although it enables the functionalities people want, IMHO, it might not be easy to maintain them as a framework and it is not easy if you want to leverage multiple lowering strategies in the incremental way. Therefore, Relax wants to tackle down this problem and provide such supports in an organized systematic way. For example, since Relax provides unified abstraction, we can introduce GraphIR->TIR transformation into the pipeline and this is essentially what lowering does. Thus, by introducing such mechanism as a Relax->TIR transformation pass, Relax can bring those functionalities into the main pipeline in a customizable manner. We expect users may be able to reuse most of lowering machinery since most of times, you just want to change "how to lower" part. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1285757823

   There were concerns that bought up in [RFC #95](https://github.com/apache/tvm-rfcs/pull/95) that this RFC conversation did not cover "how proposal fit into TVM". We agree that discussing the fit is important and would like to refer to related conversations and sections:
   
   - https://github.com/YuchenJin/tvm-rfcs/blob/relax-upstream-rfc/rfcs/0089-relax-upstreaming.md#6-rationale-and-alternatives demonstrates the design deeply aligns with TensorIR, topi, symbolic shape and PackedFunc and many other modules.
   
   - https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1267729342 discusses the fit and how relax can address e2e dynamic shape compilation problem.
   
   - https://github.com/tqchen/tvm-rfcs/blob/main/rfcs/0091-establish-tvm-unity-connection.md outlines the fit of the unified composable flow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1270640946

   For those interested, I think [this recent paper](https://arxiv.org/pdf/2210.02374.pdf) shows one way as to how symbolic shapes could be make to work with Relay's type checking approach (Axon is clearly structured very similarly to Relay), though it would require substantially reworking the existing type relations in Relay. It's rather different from Relax's approach, so it's a possible point of comparison.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on pull request #89: [RFC] Relax Upstreaming

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1652153033

   Just another update and gentle reminder, it is great to see unity being developed and used for dynamic shape and emerging usecases
   
   One goal of the G1 is to give some time answer question. There are more topics of related interest(some might related to the questions in this thread https://discuss.tvm.apache.org/c/development/unity/14) Please check it out and feel free to participate the discussions and technical questions
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [RFC] Relax Upstreaming [tvm-rfcs]

Posted by "slyubomirsky (via GitHub)" <gi...@apache.org>.

slyubomirsky commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1904942456

   It's worth noting that with the merging of Unity into TVM's main branch, Relax has already been _de facto_ upstreamed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950885777


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).

Review Comment:
   The ultimate goal is to consolidate the Relay VM and Relax VM, which requires work such as refactoring the Relay VM codegen. As a first step we bring the new vm in — which has already reused a bunch of insights from the Relay VM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] masahi commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

masahi commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r952460568


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.

Review Comment:
   I think you meant "create TE compute", rather than schedule.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949630757


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+

Review Comment:
   Yes, `Ref`s and a few operators like `dropout`. The main reason for `Ref` to exist is for AD and training, so I imagine compiler support for such features would improve if training becomes a more important usage for TVM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] ekalda commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

ekalda commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949982699


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.

Review Comment:
   We can already have Relay and TIR in the same `IRModule`, so considering the question of dynamic shape support orthogonal to it, what new functionality does `call_tir` add? 



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.

Review Comment:
   Out of interest, can you bring an example where this kind of workflow is applicable? Are the users supposed to write their models directly in Relax? Where should the weights data come from? 



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.

Review Comment:
   Can you elaborate more on this? We already can lower separate subgraphs using different strategies



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.

Review Comment:
   Some detail on the new BYOC would be appreciated as well - what does the Relax BYOC offer in terms of functionality that the current one doesn't?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.

Review Comment:
   I second to that - if Relax is a superset of Relay, why don't we extend Relay? 



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).

Review Comment:
   I don't know the VM executor very well, but the impression I get is that Relax VM won't be hugely different from Relay VM, so why don't we immediately consolidate the two, avoiding the awkward situation of having to maintain two very similar executors in the stack?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST

Review Comment:
   It would be good to have a separate RFC for each one of the new nodes presented here, it would make having discussions easier, considering how much discussion we got out of adding one IR node (`tir.constant`).



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.
+- Calls into `te.create_prim_func` to create a TIR PrimFunc.
+- Generates a call into the generated TIR PrimFunc via `call_tir`.
+
+Bridging Relax and TIR is simple and clean given that Relax has symbolic shape as first class and the support for `call_tir` for cross-layer interactions.
+
+**Relay → Relax translator**
+
+To immediately boost the coverage of models and leverage existing relay optimizations, a Relay-to-Relax translator is implemented. The translator visits the Relay graph in post-order, lowers relay ops to their TOPI functions using `OpStrategy`, and uses `emit_te` to generate corresponding TIR PrimFuncs and a Relax `main` function that contains a sequence of `call_tir` that call into these generated TIR Primfuncs. 
+
+## 4.5 PR list
+
+We plan to split the upstreaming into the following manageable PRs for TVM community review:

Review Comment:
   Introducing a separate compiler into the source tree based on one RFC seems like too much in one go. I'd expect to see one RFC for each Relax feature.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950294616


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+

Review Comment:
   In addition to @slyubomirsky and @Hzfengsy's great points, I will share my thoughts at the perspective of optimization and compilation pipeline. 
   
   Although it might have been possible, the interaction between graph IR and TensorIR/PackedFunc has been quite tricky in Relay world. This has caused significant difficulties and non-trivial engineering efforts, IMO. Here are some representative examples:
   
   - In Relay, there has been no convenient way to optimize graph IR by using the feedback from low-level.
     - If TensorIR performs layout transformation for a primfunc, its decision will affect other primfuncs as well. However, Relay cannot provide such feedback back to graph IR-level since two different IRs cannot co-exist.
     - Graph-level tuning methods (e.g., TASO, Collage) need a capability to apply a set of passes to the part of the graph, compile/measure its performance, and provide the performance number as a feedback back to Graph-IR level to generate better candidates. Although this could be achieved by nontrivial engineering efforts, it would complicate the compilation pipeline and maintenance efforts. IMHO, joint-optimization across multiple graph tuners (e.g., TASO+Collage) would be practically impossible. 
   -  Lowering should be done at once at the boundary between Relay and TensorIR and customizing lowering has been very challenging (e.g., partial/custom lowering).
       - The main pipeline with `OpStrategy` has not been easy to customize and lower part of the graph for your own target, such as BYOC, while keeping other parts still in high-level IR. Therefore, people had to figure out the way to bypass it and apply their own lowering mechanism (e.g., `RelayToTIR`) that bypasses the main pipeline. 
       -  If you only want to apply certain schedule rules on the part of the Graph IR, you would need to lower those parts and apply schedule rules to them. However, such freedom has not been allowed for Relay main pipeline, so people had to find out workaround (e.g., use task extraction and find the primfunc among them. However, if extraction does not behave as users wanted, it would require extra engineering efforts). 
   
   Since Relax unifies abstraction, it can deliver those functionalities as compiler passes while providing flexibility and customizability. For example, since both high-level and low-level IRs co-exist, if TensorIR performs optimization decision that may have global effect, like layout transformation, we can rewrite the graph-level IR accordingly to express such change and consider its global implication. Also, lowering can be implemented as a RelaxIR->TensorIR transformation pass.  If you want to bring your own lowering mechanism, you can write a new pass. I expect you may be able to reuse most of the lowering machinery and only change the part about "how" you want to lower. 
   
   I would be happy to discuss further if you are interested in this direction. :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950506090


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 

Review Comment:
   I am working on language specifications, though I will defer to @YuchenJin as to whether they should be part of the RFC text. We actually have some unresolved debates on how it should work: https://github.com/tlc-pack/relax/issues/222



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] HLearning commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

HLearning commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1381231123

   need to update it quickly
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] Lyken17 commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

Lyken17 commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1309546688

I learn a lot from reading through the thread, and find most people here are from a system background: either doing related research in schools or heading an engineering team in companies. I would like to share some of my thoughts from a different perspective, as a **TVM user** and **ML algorithm developer**.

I am a graduate student at MIT and studying efficient deep learning algorithms and co-designs (details in [my page](http://lzhu.me/), [lab site](https://tinyml.mit.edu/) and [our recent project that trains NN on a 256kB MCU](https://tinytraining.mit.edu/)). We have been honest TVM users because of its flexibility, high performance and open-source. But, when we want to dive deeper and make some customizations, things are becoming complex and relay is no longer friendly

* **Unnecessary long call stack between python and cpp**: Take `relay.build` as an example, a relay graph (in python) first does shape check (in cpp), then calls to wrapper (python), later feeds into TensorExpression (either in python or cpp), and then feed into VM for compilation (packed functions). ANY step in the middle can raise errors and developers can easily get lost in the pipeline. Actually you can find a lot of users reporting similar issues on the forum and only very few of them can fortunately get an answer from experienced developers.
* **Difficult to add a new operator because of complex pipeline**: In our research, and also many other users development, adding new operators is a common request. But in current relay, even if we just want to add a simple Identity operator (y = x), we need to
1. declare an attribute node.
2. write type relation check in CPP.
3. register OP in CPP.
4. describe the compute.
5. describe the schedule.
6. wrap up with CPP.
7. wrap up with python.
Seven steps just to define an identity function? Seriously? In PyTorch it won't cost more than 20 lines. This significantly slows the growth of TVM community and if you check the [PR history](https://github.com/apache/tvm/commits/main/python/tvm/relay/op), the numbers of new operators and new contributors are quite limited this year, while PyTorch receives new operator implementations from the community every day.
* **Missing capability to call third-party implementations**: Relay syntax does not, at least not easily, support users from call 3rd party backend like CuDNN, OpenVino, TensorRT. For the cloud, CuDNN and TensorRT are still SoTA for most benchmarks and without simple integration means inferior performance, which will make fewer people choose TVM. For the edge, the situation is even more serious because of hardware diversity. Take Qualcomm DSP as an example: even though the TVM hexagon support is in progress, but the best solution is still those manually written kernels in [SNPE](https://developer.qualcomm.com/sites/default/files/docs/snpe/overview.html). It is not trivial to call other backends in current relay: BYOC is difficult to use and register custom operators can be quite complex as discussed in last point.

I understand those who want the backward compatibility so existing projects are not broken. But we cannot build a ship of Theseus in the real world and the above issues cannot be easily "improved" with current relay. If TVM do not embrace new designs and improve its user-friendliness, then, eventually developers will switch to other tools and this is indeed happening:
* [Oneflow uses MLIR to rewrite their compiler pass](https://github.com/Oneflow-Inc/diffusers/wiki/How-to-Run-OneFlow-Stable-Diffusion) to accelerate diffusion models by 4x compared with pytorch and 1.6x compared with TensorRT.
* [Megvii adapts MLIR to minimize runtime build](https://github.com/MegEngine/MegCC) to generate YoloX binary with just 95kB.
* [PyTorch proposes TorchDynamo to speedup training](https://github.com/pytorch/torchdynamo/) and achieves average 1.34x speedup over previous NVFuser.
* ...

I like the TVM project and hope the community can be always active. TVM has a huge user base of researchers and Relax can allow them to easily contribute their code and idea to the repo, instead of tricky hacking and creating separate repos for each project. This is important for an open-source community -- just recall how mxnet loses its market and why PyTorch can beat TensorFlow even released one year later. TVM should consider Relax's upstreaming given its more thoughtful and user-friendly design, well-written documentation/tutorials, and S0,1,2 painless upgrading.

I would like to discuss more if there is any comments and questions.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhiics commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

zhiics commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r951022935


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)

Review Comment:
   Okay, I see. We'll still have `shape_of` to get the statically unknown shapes at runtime.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1226583468

Having taken onboard the feedback from community members (acknowledge the reviewers here), a number of us involved in this RFC (@YuchenJin, @jwfromm, @tqchen, @areusch, @mbaret, @jroesch, @tmoreau89) feel it’s necessary to be explicit about the scope of this proposal, and we apologize to those reviewing that this was not present in the original text.

- Acceptance of this RFC doesn't mean there is an agreement to eventually deprecate Relay and replace it with Relax. It only permits bringing the development that's currently occurring on the Relax fork into the TVM repo. This will improve the accessibility of that important work for community stakeholders who rely on it, as well as bring Relax under TVM project governance.

- If at a later stage it's found that individual features from Relax are desired in the Relay compiler (e.g. dynamic shapes, TVMScript support), design discussions and RFCs must take place to determine the best way to implement those features. Acceptance of this RFC gives no preference to Relax as the solution, and so evolving Relay would remain firmly on the table in those discussions.

The RFC has been accordingly amended to include the above commitments, which we hope addresses some of the valid concerns expressed so far.

cc: @leandron @ekalda @Mousius

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

tqchen commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1336475126

After seeing so many voices in this thread. I think it is important to provide a reply here.

I am wearing the Apache TVM hat as a ASF member and Apache TVM PMC member.

First of all, I would like to say thank you, everyone, for sharing your voices here. This post has received support from more than eight organizations from both industry and academic backgrounds. Your voices are very important to the community and will not be ignored. As many said, we would love the TVM community to continue being inclusive and innovative while maintaining the stability of existing developed components.

I also would like to come out and acknowledge the positions so far:

The position that @leandron made so far was:
- We do not like to be in a state where relax and relay coexist without deciding the commitment of one replacing another.
- As a result, due diligence of such a replacement is mandatory before merging the proposal.

I would like explicitly to acknowledge that the above positions have valid rationales, are completely valid, and can be a possible way of software development.

I think the position raised by @YuchenJin and others were:

- Relax could have the potential to replace relay, but the proposal as it only proposes to have the two modules coexist.
- Just like how most OSS projects bring in modules and evolve things (e.g. TorchFX being brought in overlaps with TorchScript, nor plans to immediately phase out TorchScript). The modules can coexist, evolve, and we continue conversations about future co-evolution.
- Relax and Relay coexist in the codebase is already a positive step that we shall take, especially considering community empowerment.

These are also valid rationales and can be possible ways of developing things.

As a first step, I would like to acknowledge each others’ positions as they are valid rationales. The main difference is that there is a disagreement on how we should do things as a community.

Such a decision should be made collectively as a community, considering all the factors involved: including code and community factors. We all make our suggestions holding innovation, stability, and community into account.

When evaluating a proposal and empowering our community members, we expect every one of us to continue having a constructive conversation, considering the latest context.

While the initial comment made by @leandron is valid on its own, I would love to see we re-evaluate our positions message considering all the factors in the latest context, including community empowerment and the collective views of other members. I want to say that by no means do we simply seek to dismiss the original position -- i would apologize if I it makes it feel that way. Instead, we want to acknowledging each view, and we have disagreements on hows, and taking community into consideration.

I think we should continue to have constructive conversations in services of many who have voiced their support here.

Thank you!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949630757


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+

Review Comment:
   Yes, `Ref`s and a few operators like `dropout`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949618073


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.

Review Comment:
   I guess, in principle, it might be feasible to add the possibility of symbolic shapes in Relay with the provision that they be checked dynamically, but it would be very challenging to figure out how to have it coexist with type relations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] Hzfengsy commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

Hzfengsy commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950165733


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.

Review Comment:
   `call_tir` is not only designed for dynamic shape support. It enables optimization/transformations of an IRModule for both GraphIR and TensorIR. We do support having Relay and TIR in the same IRModule, but we can not optimize them together. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950339476


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.

Review Comment:
   Thanks for the comment! Yes, it has been supported in Relay but there are some nontrivial limitation, imo. 
   
   - (1) Relay main pipeline lowers every Relay IRs into TIR at once at their IR boundary. This makes partial lowering (lower only part of the graph) difficult in the main pipeline. 
   - (2) Relay main pipeline supports lowering with `OpStrategy`. However, it is not necessarily easy to customize it (custom lowering)
   
   For these reasons, people introduced `RelayToTIR` and `RelayToRuntime` that essentially bypass the main pipeline. Although it enables the functionalities people want, IMHO, it might not be easy to maintain them as a framework and it is not easy if you want to leverage multiple lowering strategies in the incremental way. Therefore, Relax wants to tackle down this problem and provide such supports in an organized systematic way. For example, since Relax provides unified abstraction, we can introduce GraphIR->TIR transformation into the pipeline and this is essentially what lowering does. Thus, by introducing such mechanism as a Relax->TIR transformation pass, Relax can bring those functionalities into the main pipeline in a customizable manner. You can also easily conduct partial lowering within a pass since you have a full control. We expect users may be able to reuse most of lowering machinery since most of times, you just want to change "how to lower" part. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on pull request #89: [RFC] Relax Upstreaming

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1715756599

   sending another reminder for everyone to chime into related unity discussion threads https://discuss.tvm.apache.org/c/development/unity/14, love to see your participations on all the technical discussions and see the how we can collectively address your needs 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r957864999


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.

Review Comment:
   Thanks for the catch! It's updated.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r957865250


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation

Review Comment:
   Thanks for the suggestion, it makes sense!
   
   I updated the "shape heap" to "shape tensor" in the RFC, and mentioned `alloc_shape_tensor` can be merged with `alloc_tensor` to achieve the same goal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950339476


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.

Review Comment:
   Thanks for the comment! Yes, it has been supported in Relay but there are some nontrivial limitation, imo. 
   
   - (1) Relay main pipeline lowers every Relay IRs into TIR at once at their IR boundary. This makes partial lowering (lower only part of the graph) difficult in the main pipeline. 
   - (2) Relay main pipeline supports lowering with `OpStrategy`. However, it is not necessarily easy to customize it (custom lowering)
   
   For these reasons, people introduced `RelayToTIR` and `RelayToRuntime` that essentially bypass the main pipeline. Although it enables the functionalities people want, it is hard to maintain them as a framework and it is not easy if you want to leverage multiple lowering strategies in the incremental way. Therefore, Relax wants to tackle down this problem and provide such supports in an organized systematic way. For example, by introducing such mechanism as a Relax->TIR transformation pass, Relax can bring those functionalities into the main pipeline in a customizable manner. We expect users may be able to reuse most of lowering machinery since most of times, you just want to change "how to lower" part. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] Hzfengsy commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

Hzfengsy commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1220829699

Thanks @leandron and @ekalda for the comments. We all agree that we are trying to improve the graph-level IR of TVM while the controversial point is that if we can enhance relay to support features from relax. Let's discuss it directly and focus on the technical points themselves.

First of all, I'd like to list some most critical features that relax want to introduce:

1. Dynamic shape support, to be specific symbolic shape representation;
2. A representation for TVMUnity, i.e. a cross-layer abstraction for optimization;
3. Customizable compilation flow and operator support.

In my opinion, it's hard to incrementally update relay to support them.

## G1: Dynamic shape support

To be specific, relax can represent and *deduce* symbolic shape rather than use `Any`. However, if we introduce dynamic shapes to relay, there will be two competing repr for shapes (symbolic shape and `Any`), which makes it undesirable.

## G2: A representation for TVMUnity

TVMUnity is an important feature for unified optimization for graph, tensor computation, and libraries. The build flow of relay is a one-way path: `relay->tir/libraries->runtime module`, while TVMUnity enables `IRModule(graph+tir+libraries)->IRModule` transformations, which gives users more flexibility to choose the backend (use codegen or call libraries) even after tuning. I'm not sure if it's possible for relay if we still keep the original workflow.

## G3: Customizable compilation flow and operator support.
Customizing operators and backends are really common in production. There are [7 steps](https://tvm.apache.org/docs/dev/how_to/relay_add_op.html) to add a new operator to relay. However, we only need 2 steps in relax:
1. write how the op is computed (both tir or libraries are good),
2. use `call_tir` to represent it in IRModule

Additionally, other compilation customization skills (e.g. BYOC, AOT, customized fusion) are also more straightforward with relax. Please see the [TVM Unity Connection](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344#byoc-9)

In short, a new IR is a reasonable way to support the above features IMO. And I'm open to hearing more ideas from the community.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949618073


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.

Review Comment:
   I guess, in principle, it might be possible to add the possibility of symbolic shapes in Relay with the provision that they be checked dynamically, but it would be very challenging to figure out how to have it coexist with type relations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949608390


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+

Review Comment:
   I think there is a significant technical reason for why such a feature would be difficult to add into Relay, which is that any call directly into a TIR function from Relay would need to add a type relation that describes the tensor shape requirements. Relax adds more capabilities for checking tensor shapes dynamically and is much more flexible in this regard than Relay. That said, I think "is making a new IR better than trying to add these features into the existing one?" is a good question to be asking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] Mousius commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

Mousius commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r952413924


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.

Review Comment:
   > Thanks for the comment! Yes, it has been supported in Relay but there are some nontrivial limitation, imo.
   > 
   > * (1) Relay main pipeline lowers every Relay IRs into TIR at once at their IR boundary. This makes partial lowering (lower only part of the graph) difficult in the main pipeline.
   > * (2) Relay main pipeline supports lowering with `OpStrategy`. However, it is not necessarily easy to customize it (custom lowering)
   > 
   > For these reasons, people introduced `RelayToTIR` and `RelayToRuntime` that essentially bypass the main pipeline. Although it enables the functionalities people want, IMHO, it might not be easy to maintain them as a framework and it is not easy if you want to leverage multiple lowering strategies in the incremental way. Therefore, Relax wants to tackle down this problem and provide such supports in an organized systematic way. For example, since Relax provides unified abstraction, we can introduce GraphIR->TIR transformation into the pipeline and this is essentially what lowering does. Thus, by introducing such mechanism as a Relax->TIR transformation pass, Relax can bring those functionalities into the main pipeline in a customizable manner. You can also easily conduct partial lowering within a pass since you have a full control. We expect users may be able to reuse most of lowering machinery since most of times, you just want to change "how to lower" part.
   
   I've put a reply to this above but I'll quickly summarise in response here. `RelayToTIR` works together with the main pipeline to allow partial lowering of a `Target`, it runs before the main `TECompiler` and therefore seems to be exactly what you're describing as Relax->TIR? We leave the module in a partial TIR/partial Relay state for the pipeline to continue lowering. 
   
   I do agree `RelayToRuntime` essentially bypassing the entire compiler is not particularly helpful for re-use, this was a motivation for creating `RelayToTIR` in the first place.



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.

Review Comment:
   How is this different to checking if a `Call` is to a `tir::PrimFunc` ? 



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+

Review Comment:
   > In Relay, there has been no convenient way to optimize graph IR by using the feedback from low-level.
   > * If TensorIR performs layout transformation for a primfunc, its decision will affect other primfuncs as well. However, Relay cannot provide such feedback back to graph IR-level since two different IRs cannot co-exist.
   > * Graph-level tuning methods (e.g., TASO, Collage) need a capability to apply a set of passes to the part of the graph, compile/measure its performance, and provide the performance number as a feedback back to Graph-IR level to generate better candidates. Although this could be achieved by nontrivial engineering efforts, it would complicate the compilation pipeline and maintenance efforts. IMHO, joint-optimization across multiple graph tuners (e.g., TASO+Collage) would be practically impossible.
   
   It's possible for an IRModule to contain both `tir::PrimFunc` and `relay::Function` at the same time which would allow the replacement of any given `Call` operator inside of Relay with a lower level operator to trial as part of a scheduler - this seems to indicate it's entirely possible using an `IRModule` to connect higher and lower level IR unless I'm missing something?
   
   We've previously done this in CMSIS-NN, `relay::Function`s become `tir::PrimFunc`s as part of our custom lowering in `RelayToTIR` which results in an `IRModule` with some functions lowered to TIR and some remaining in Relay. 
   
   > Lowering should be done at once at the boundary between Relay and TensorIR and customizing lowering has been very challenging (e.g., partial/custom lowering).
   > * The main pipeline with `OpStrategy` has not been easy to customize and lower part of the graph for your own target, such as BYOC, while keeping other parts still in high-level IR. Therefore, people had to figure out the way to bypass it and apply their own lowering mechanism (e.g., `RelayToTIR`) that bypasses the main pipeline.
   > * If you only want to apply certain schedule rules on the part of the Graph IR, you would need to lower those parts and apply schedule rules to them. However, such freedom has not been allowed for Relay main pipeline, so people had to find out workaround (e.g., use task extraction and find the primfunc among them. However, if extraction does not behave as users wanted, it would require extra engineering efforts).
   
   `RelayToTIR` allows a `Target` to customise how it decides to lower. Every `Target` can register a `RelayToTIR` hook which uses a set of compose-able `Pass`es to create a per-`Target` lowering pipeline; these `Target`s would compose together to form the main pipeline. This indicates we have the capability to create more complex per-`Target` lowerings of not just operators, but also groups of operators using `RelayToTIR` already with the flexibility to improve any logic for a specific `Target` or re-use existing `Pass`es where necessary?
   
   The limitation I can see with this is that trying different operator strategies within the per-`Target` lowering isn't well supported? Could this not be supported by providing search information to `RelayToTIR`? 🤔 



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:

Review Comment:
   Could you clarify what you mean when you say that Relax maximises expressibility as opposed to Relay? My assumption is that this is the TVMScript addition but that could equally have been done for Relay 🤔 



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.
+- Calls into `te.create_prim_func` to create a TIR PrimFunc.
+- Generates a call into the generated TIR PrimFunc via `call_tir`.
+
+Bridging Relax and TIR is simple and clean given that Relax has symbolic shape as first class and the support for `call_tir` for cross-layer interactions.
+
+**Relay → Relax translator**
+
+To immediately boost the coverage of models and leverage existing relay optimizations, a Relay-to-Relax translator is implemented. The translator visits the Relay graph in post-order, lowers relay ops to their TOPI functions using `OpStrategy`, and uses `emit_te` to generate corresponding TIR PrimFuncs and a Relax `main` function that contains a sequence of `call_tir` that call into these generated TIR Primfuncs. 
+
+## 4.5 PR list
+
+We plan to split the upstreaming into the following manageable PRs for TVM community review:
+
+- Relax IR

Review Comment:
   `Relax IR` really needs an independent RFC which details the dynamic type system so we can discuss methods of integrating it directly into Relay, this seems to be one of the larger technical hurdles within this RFC (though I agree with @ekalda that really we need RFCs for each independent architectural piece here)



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.
+- Calls into `te.create_prim_func` to create a TIR PrimFunc.
+- Generates a call into the generated TIR PrimFunc via `call_tir`.
+
+Bridging Relax and TIR is simple and clean given that Relax has symbolic shape as first class and the support for `call_tir` for cross-layer interactions.
+
+**Relay → Relax translator**
+
+To immediately boost the coverage of models and leverage existing relay optimizations, a Relay-to-Relax translator is implemented. The translator visits the Relay graph in post-order, lowers relay ops to their TOPI functions using `OpStrategy`, and uses `emit_te` to generate corresponding TIR PrimFuncs and a Relax `main` function that contains a sequence of `call_tir` that call into these generated TIR Primfuncs. 
+
+## 4.5 PR list
+
+We plan to split the upstreaming into the following manageable PRs for TVM community review:
+
+- Relax IR
+- Relax VM
+- BlockBuilder
+- ExprFunctor/ExprVisitor/ExprMutator/IRFunctor
+- Relay → Relax translator
+- Minimum build (4 passes)
+- VM Codegen
+- E2E model compilation + execution

Review Comment:
   The upstreaming plan should be represented in the Relax Roadmap (https://github.com/apache/tvm-rfcs/blob/main/rfcs/0069-relax-roadmap.md)? Which should also contain references to these more detailed RFCs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] yzh119 commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

yzh119 commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1271521751

I'm a graduate researcher at UW and have been working as a full-time SDE at AWS AI for years, mostly around Deep Learning Frameworks Libraries. I feel like all of us agree dynamic shapes are essential so I don't want to spend more time emphasizing how important it is. I'm not a contributor to Relax, but I have been following it for a long time. I don't want to pretend to be neutral, I do think it is quite necessary to welcome Relax, rather than just adding dynamic shape support in Relay.

The main controversy in this thread is about whether to upgrade Relay incrementally or develop a new IR called Relax. I understand hardware companies appreciate stability, and we can see CUDA didn't change its interface drastically over the years, what a miracle! There must be several times people wanted to develop new languages/compilers for NVIDIA GPUs but CUDA survived, this is a lesson we should learn: in the beginning, we design things with a vision of the future in mind, then we maintain them with high standard, improve it incrementally and be customer-obsessed.

This is the ideal story, but we should not ignore that though CUDA was invented before the DL era, there are already many high-performance computing workloads the designer can refer to. Fortunately, even in 2022, the operators used in DL still highly align with HPC ones and are actually simpler (it's a world of GEMM). What about the story of (computational) graph-level IRs? The dominant workload in DL changes over time and I would say they cause a lot of headaches for framework and compiler designers: first CNNs/RNNs/LSTMs/Tree-LSTMs(the structure dynamism is one of the challenges Relay would like to tackle, but unfortunately they are used nowhere), then we have Transformers/GNNs(not as hot as Transformers because of hardware lottery, but who knows the future). Now we have entered a time where models converge, but scalability grows significantly: models become larger and larger, and a lot of engineers and researchers propose (checkpointing and rematerialization, quantization, grap
h substitution, fusion and stitching, sparsification and mixture-of-experts, hybrid parallelism) to optimize DL workloads at compile-time, and I'm glad to see many of them are developed upon TVM because TVM's design is always up-to-date and support new workloads quickly, however, Relay's current design cannot take full advantage of these new techniques, and the system has the trend of becoming fragile. Relax is a great opportunity for us to reconsider the graph-level IR design: prune the redundancies and add new functionalities, it's exciting to see we can unify different levels of optimizations together in [TVM Unity](https://github.com/apache/tvm-rfcs/pull/91), once Relax is accepted by the community. Refactor makes things simpler, rather than complex.

Whenever we found it's time to make some changes, TVM always embraces new designs. This happens several times in TVM history: Prior to Relay, there is NNVM, which is deprecated and completely replaced with Relay. The previous Tensor-Expression has limited expressiveness, and the schedule tree data structure cannot support tensorization elegantly, then we have TensorIR, which is not only backward compatible, but also brings opportunities for developing new dialects (Ruihang and I designed SparseTIR upon it, works pretty good). The AutoTVM cannot generate scheduling templates automatically, then we have Ansor and Meta-Scheduler. I would emphasize that **the most important part of all these updates are upstreamed within several months**, and do not break any backward compatibility issue, it credits to our hard-working and open-minded contributors and reviewers. Committing to TVM helps these contributors become MLC experts, some of them are PMC members now. I would say non of these re
factoring influences TVM's reputation, on the contrary, it makes people impressed by TVM's speed in adapting to the future, and they are more willing to try TVM because it's open, it's driven by innovation.

I really don't understand what's the difference this time, when it comes to Relax? We have a bigger community, this is awesome and I definitely welcome your input and constructive suggestions on the future of this project. I view the [New Scoped Module RFC](https://discuss.tvm.apache.org/t/process-rfc-empowering-new-scoped-module-to-the-project/13617) as a contract between industrial developers and researchers/engineers like me that works on "toy prototypes", we promise not to touch anything that might influence user experience, we also don't want to be discouraged because my prototypes cannot be upstreamed and only stay in some random GitHub repo as a toy. I also think the new S0-S1-S2 progress is already the most painless approach to delivering new designs, and the effect is equivalent to *incremental change*. If people take a look at the Relax repo, it already has a huge amount of code there and well-written documentation (you can compare it with the official relay documentatio
n), I think it's super inappropriate to ignore these contributors' devotion, especially individual contributors such as @LeshengJin . TVM has a huge user base of researchers, they are an important part of the community, and they also contribute high-quality code instead of just hacking.

Regarding the "lower standard than other communities" issue, TVM has high standards and we are not talking about standards. If no fundamental changes are allowed in DL infrastructures, google should stay at TF 1.0 and never develop JAX, and PyTorch should not create so many different compiler infrastructures (I want to share [this slide](https://chips-compilers-mlsys-22.github.io/assets/slides/PyTorch%20Compilers%20(Compiler%20&%20Chips%20Symposium%202022).pdf) again.

It's 5 am in my timezone, I should have some sleep and I'm still recovering from my recent illness. Opinions on my own and I don't speak for any groups/organizations.

Best,
Zihao

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] FrozenGene commented on pull request #89: [RFC] Relax Upstreaming

Posted by "FrozenGene (via GitHub)" <gi...@apache.org>.

FrozenGene commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1717067037

   I want to know do we have plan to decide when to merge Unity branch into main branch? As LLM is so popular now, without Unity, we can not support it well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950306900


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).

Review Comment:
   Thanks for the catch. It will be supported and [relax repo](https://github.com/tlc-pack/relax/tree/relax/src/relax/backend/contrib) has demonstrated the functionality of JSON runtime w/ TensorRT. Maybe good to clarify this in the RFC. cc. @YuchenJin 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] junrushao commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

junrushao commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1222788202

   Thank you @leandron @ekalda for the questions, and @zhiics, @slyubomirsky, @Hzfengsy, @sunggg for the discussion!
   
   As a long-term contributor since 2018, the pre-Relay era, and the initiator and top 2 contributors of RAF ([https://github.com/awslabs/raf/](https://github.com/awslabs/raf/)), the TVM-based training framework, I would love to share my perspective and slight concern about the TVM development at this moment, 2022.
   
   While being a decent auto-tuner for static shape workloads, and the latest work with auto tensorization further boosted its performance with microkernel tuning, there has been strong demand from the community to allow TVM to do more, which as @YuchenJin listed, includes:
   
   - Unified abstraction
   - Dynamic shape support
   - Dataflow block and first-class side effect handling
   - Non-inference workloads
   
   As a community, we do encourage everyone to understand different perspectives and empower each other, and I believe this is the way for us to grow.
   
   Technically, just wanted to address a meta question here: why is it less feasible to gradually upgrade Relay?
   
   - Conflicted design philosophy: Relax follows a completely different design than Relay with mutually conflicting assumptions and ideas. For example, having two conflicting shape mechanisms in the system would effectively mean passes have to handle both of them.
   - Engineering challenge: design difference leads to hurdles for incremental updates. For example, if we want to move away from the assumption that the IR is side effect-free, all the passes with the old assumption become automatically invalid or wrong because the assumption is not respected.
   - Stability concern: Even if we do surgical incremental enhancement to Relay by introducing breaking changes piece by piece, there is still stability concern. Consider a case where there are downstream vendors whose forks depend on upstream Relay, and Relay’s assumptions break over time, it would be less stable for them to maintain Relay.
   
   Alternatively, we believe having Relax as a separate pass is a cleaner and more maintainable approach - gradually bringing some of the passes from the bottom is engineeringly incremental and guarantees that the Relay code path is always stable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950777484


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).

Review Comment:
   I am not as familiar with the Relay VM, but perhaps it might indeed be feasible to combine the VMs (have a single bytecode language) and use it for both compilers. How much additional work would it take compared to the current Relax VM prototype @YuchenJin et al?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhiics commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

zhiics commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950782551


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)

Review Comment:
   maybe `vm.compile` to keep consistent to the current vm interface?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:

Review Comment:
   Will we reuse the the pass infra? I think we can probably add a block-level transformation infra to allow users to perform some interesting transformations/analysis at the BindingBlock level.



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1

Review Comment:
   Is this just an example? Or do we expect users to assign shapes explicitly with no type inference?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,

Review Comment:
   Are we assuming there is only output or this is just for demonstration purpose? And shape and dtype will be separate in the future, right?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)

Review Comment:
   Does `builtin` indicate it is a bytecode instruction?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.

Review Comment:
   Will this imply that we will perform some runtime shape inference here to get the concrete value of `m` and `n`?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.

Review Comment:
   Do we expect users to free storage and tensors by themselves? Or we will introduce some liveness analysis approaches to decide when storage and/or tensor should be freed?
   
   In addition, we may need to reuse the current device placement pass when executing on GPU since putting some data (i.e. scalar input) on GPU may not make sense.



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).

Review Comment:
   I feel we could probably leverage some of the existing VM infra, i.e. the way we manage the frames, the bytecode de/serialization mechanism, and likely the memory pool, etc. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950885236


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)

Review Comment:
   `builtin` indicates a bunch of packed functions that the VM can invoke via the `Call` instruction. This includes functions such as `shape_of` for obtaining runtime shapes, and various memory management functions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950884708


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:

Review Comment:
   Exactly! We will reuse the existing pass infra, and introduce a DataflowBlock-level pass that only permits dataflow block level transformations. It can cover most cases of graph rewriting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950303415


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:

Review Comment:
   This specific runtime is only for Relax counterpart for Relay VM. We understand your concern. Hope following RFC can address and clarify this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] masahi commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

masahi commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r952361697


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.

Review Comment:
   Does this mean all data-dependent dynamic ops need to have runtime packed functions, for all targets? 
   
   Even in Relay / TE we can implement `unique` / `NMS` perfectly fine and we don't need special runtime support. We do that by essentially treating `unique` and `NMS` etc as a static op and pushing all dynamism handling to dynamic `strided_slice`.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r957864345


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.

Review Comment:
   This is a great question! This does not mean data-dependent ops are required to be implemented using PackedFunc.
   
   In Relax, we also support implementing dynamic ops using TE/TIR. For example, we can implement `unique` by splitting the operation into multiple phases (first outputting a tensor of the same size as the input data) like you said, and this is supported by EmitTE.
   
   One way to quickly support operators in Relax is to fall back to third-party framework libraries, for example by calling `torch.unique` as mentioned in this paragraph. With the first-class support of calling PackedFunc in the graph IR (via `call_dps_packed` and `call_packed`), we can generate direct calls to these third-party libraries to get immediate op coverage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r953154506


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+

Review Comment:
   Thank you for sharing your thoughts, @Mousius. It is interesting to learn that these problems has been attacked in BYOC world. Like @comaniac, @junrushao and others, I want to clarify I don't disagree that these functionalities can be achieved by extending current Relay framework. However, in my PoV, it is matter of amount of the engineering effort and future extensibility. Making this scale of change, which is related to the framework-level design philosophy, in the main pipeline while providing a guarantee for existing behavior and performance sounds extremely difficult, IMHO. Although I joined the Relax project in the middle, I believe these are parts of motivation to innovate new IR with those new design consideration and goals. Since Relax pipeline does not disturb Relay pipeline, I think both pipeline can evolve together. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

tqchen commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1224114184

Thank you, everyone, for the discussions here. Let us take a step back and look at the non-technical parts of the conversation. A lot of our discussions come from two goals:

G0: Maintaining a stable evolution solution for some of our common use-cases
G1: Welcome new improvements, land our technical commitment timely, continue to reinvent ourselves, and welcome new community members who have new use cases.

Both goals are very important. G0 ties to our ability to continuously support our current use cases. G1 is also essential to our viability as a solution, so we can grow as a community and stay competitive in a fast-evolving machine learning compilation landscape.

Enabling both has always been an important theme of long-living projects. Deep learning frameworks are a common reference to refer back to. Usually, they are done in roughly three phases:
S0: Introduction of a new feature/component as an optional module.
S1: Evolving the overall solutions to make use of the new component.
S2: Consider deprecation of some of the existing solutions, or evolve the solutions for a consolidation point.

Each stage contains a different level of commitment and would normally entail different levels of gating criteria as we look at them.

For example, PyTorch introduced TorchFX as an optional module that supports graph tracing and export. It had some overlapping capabilities with TorchScript. The PyTorch community is collectively evolving some of the compilations (TorchDynamo) to make use of FX. As of now, there is not yet an announcement of S2 from the community.

Encouragement of S0 and making it easy to do helps us to enable G1. A too high barrier here can discourage community contributions and result in mainline lacking the latest features and short-living our competition. This is especially important given that the land of machine learning compilation still remains open, and the ability to timely support symbolic shape and training helps bring in users and contributions who would otherwise turn to alternatives.

G0 is equally important here. In many cases, they boil down to making careful and informed decisions regarding evolution (S1 and S2). Additionally, making sure that at S0 stage, there is a limited disruptive change to the existing infrastructure. Importantly, not every module/feature has to go through all stages. And in common practices, the decisions in each stage are usually not made at the same time.

We can find examples of S0 cases in TVM as well. For example, USMP was currently designed for specific cases like AOT. We welcomed these improvements to unblock needs in embedded settings early. Through USMP we found the need of tir.alloc_const, which related to evolving on existing infra(S1). As a result, we had a more in-depth discussion. Additionally, we are bringing the effort to further enable USMP in a broader setting as part of S1. At some point, we might consider consolidating all memory allocations as S2 – note that many community members are collectively working toward that goal, but we are not yet at a point to make such a decision. As another example, we enabled cascaders that are specifically designed for micro-NPU, which had some domain overlapping with the arithmetic affine module, but nevertheless bought in without consolidation because we believed that there is enough interest and maintenance support for the module. Finally, the unpacked_api was specifically ena
bled for extremely low-resource settings, and we enabled S0 level inclusion despite some inconsistency with the packed func API.

Of course, we do not want to enable random things in the codebase, which ties back to the maintenance overhead concern. One of the questions we want to ask here is whether the module contains enough support from the community that allows continued maintenance. Additionally, we should consider the fact of added engineering support by welcoming additional community members who are interested in the needs and would otherwise look elsewhere.

Our overall thought process and decision time point for each stage can be different – they should be so we can enable both G0 and G1. Nor do all modules have to go through all the stages.

For S0, we would expect if there are enough champions in the community with a self-contained plan. For important features, we would expect, say, more than three committers who can champion the module and significant community support to maintain them. Additionally, S0 should be made as minimally disruptive (wrt to the current infrastructure) as possible. To encourage G1, we can overlook some levels of duplications (just like the TorchFX and TorchScript case, USMP, and other allocators when they land as S0), considering the additional community support we get to maintain them.

S1 and S2 would involve more careful discussions and coordination with greater amounts of details on some of the key points. Likely, they will also happen at a different time point so we can make informed decisions.

This particular RFC is at the S0 stage and intentionally made to be so. As the RFC stated, there is no proposal to make S1/S2 decisions at this RFC. Many of our current discussions are around S1/S2 – the future evolution of the system. They are extremely helpful discussions to have to set up the context and help us improve the design, but not necessarily decisions we have to make immediately. Let us think about the broader community members we can empower and bring in through enabling the S0 improvement.

Thank you, everyone, for the discussions so far, and let us work together to enable our community.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on pull request #89: [RFC] Relax Upstreaming

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1404170556

Five years ago, we started with a community that comes with a common vision in mind – enabling machine learning engineers to optimize and run computations efficiently on any hardware backend.

Five years later, the fields of machine learning and MLC(ML compilation) have gone under rapid changes. That same vision is still shared among this community. This is why many of us still feel so fresh when writing code patches, bug fixes, architectural refactors, and new features. We are here today thanks to a diverse community that comes with different perspectives and areas of focus but still aligns around that common vision.

As a project, we benefit from different perspectives to survive in the ever-changing and competitive area of ML compilation. Hardly one can predict every detail of the future (just observing the set of recent changes such as chatGPT and stable diffusion). Hardly can one definitively assert that one approach would be better than another one from the very beginning. Enabling diverse possibilities helps to move the project forward while enabling different needs.

As a community, while we care about different subsets of modules and do not always need to work on the same thing, there is always an overlap of interests, regardless of whether it is the graph, FFI, TensorIR, or backend that sustains collaborations among different people. Most importantly, we come with a mindset of empowering each other under the same vision.

Thank you, everyone, for participating in this thread.

This thread arrives at its current state due to different perspectives on possible thoughts of project procedural operations (whether a detailed migration plan and commitment to migration are necessary for the new module proposal). There is a common agreement that migration(if it happens and is being proposed) would require a lot of details and community buy-in, but different opinions about when that can and how that should happen.

On behalf of the TVM PMC, I would like to recommend an initial step to help us to recognize achieve the following goals from different members of the community:
- G0: Get us out of stagnation and empower the community, including many who shared their support in this thread, to participate in unity development in the TVM community.
- G1: Give some time to answer questions, and provide examples to those who have shared needs to have more detailed evidence and possible feasibility analysis of migrating some modules.

Specifically, we would recommend us to follow an existing practice in projects like Hadoop, to empower related development in a branch. ASF mechanism allows any committer to create a branch in the apache repo and do collaborative development there at their own pace. Per our existing process, merging a branch into main still requires lazy consensus. Branch development offers flexibility while accepting the risk of blocking when merging to the main. As a result, there are general incentives to keep alignment with the majority of the community and continued engagement to get buy-in. Branch development offers a way to collaborate on a possible but not definitive future of the project, as a branch can come with different outcomes such as being partially merged, continued development, or abandoned. Enabling different perspectives is important for us both as a project and community.

TVM PMC re-affirmed that branch development can be used as an option for the project and specific development around tvm unity. We would like to offer it as a possible option for the community and the first step of execution, with the goal of getting related pieces into main. I wrote down a more detailed post, which we would love to get everyone’s feedback. Of course, this is only one possible option, and community members can freely choose their ways of participation.

Developing in a branch will also give some time buffer to answer G1. It is valuable to answer questions and have grounded conversations to give more information to the members who are not yet on board with the new module. Noticeably, to many community members, detailed code examples, benchmarks, and continued engagement are necessary to get broader community buy-in. We would recommend having focused discussions on the questions of interest (e.g. give concrete code tutorials for BYOC) to help the community members who have related questions. We encourage such continued conversations in forum threads, meetups, and development interactions with the goal of getting as much information as possible. Again such interactions aim at demonstrating possibilities, but not warrant deprecation or migration since that choice should still lie in the hands of the community. Hopefully, they give more comprehensive pictures for us to make follow-up decisions collectively.

As part of winter break, I started to do more coding, and I was really fascinated to see that passion still is deep in my heart(and I believe in many of us) after so many years, thanks to this community and our common vision. As a community member, I am really motivated to spend focused energy helping to build concrete code examples and tutorial materials for G1.

Please also checkout [this post](https://discuss.tvm.apache.org/t/establish-tvm-unity-branch/14244) for more details

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950322586


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.
+- Calls into `te.create_prim_func` to create a TIR PrimFunc.
+- Generates a call into the generated TIR PrimFunc via `call_tir`.
+
+Bridging Relax and TIR is simple and clean given that Relax has symbolic shape as first class and the support for `call_tir` for cross-layer interactions.
+
+**Relay → Relax translator**
+
+To immediately boost the coverage of models and leverage existing relay optimizations, a Relay-to-Relax translator is implemented. The translator visits the Relay graph in post-order, lowers relay ops to their TOPI functions using `OpStrategy`, and uses `emit_te` to generate corresponding TIR PrimFuncs and a Relax `main` function that contains a sequence of `call_tir` that call into these generated TIR Primfuncs. 
+
+## 4.5 PR list
+
+We plan to split the upstreaming into the following manageable PRs for TVM community review:
+
+- Relax IR
+- Relax VM
+- BlockBuilder
+- ExprFunctor/ExprVisitor/ExprMutator/IRFunctor
+- Relay → Relax translator
+- Minimum build (4 passes)
+- VM Codegen
+- E2E model compilation + execution
+
+# 5. **Future work**
+
+This RFC only focuses on the foundation part of Relax. After it, we will incrementally incorporate the additional capabilities and features. Relax aims to achieve parity with the functionality provided by Relay: this means that workloads which are functional on Relay will also be functional on Relax, even though the infrastructure underneath may change.
+
+Future plans that we will bring in future RFCs:
+
+- AOT: AOT compilation has a wide range of benefits such as being more space efficient and is necessary for resource-constrained projects like uTVM. We are committed to continuously supporting the AOT compilation in Relax, and there is an ongoing effort to connect Relax to the current AOT executor.
+- BYOC: We will try to reuse the existing translation spec. In Relax, BYOC can be formalized as a pass that call external packed functions.

Review Comment:
   At the BYOC integration perspective, users will need to bring the codegen that converts Relax ops to BYOC representation (e.g., JSON) while re-using the runtime. We have a demo with TensorRT: https://github.com/tlc-pack/relax/tree/relax/src/relax/backend/contrib
   
   Relax would provide more unified and organized support for BYOC. In Relay, although the functionalities are provided, IMHO, it has been quite fragmented so it has been tricky to use multiple BYOCs together with TVM's pipeline. So, in relax, our goal is to 
   (1) Support the existing functionalities
   (2) Organize and simplify BYOC workflow
   
   @YuchenJin can provide the timeline for this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950294616


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+

Review Comment:
   Thanks for the comment! In addition to @slyubomirsky and @Hzfengsy's great points, I will share my thoughts at the perspective of optimization and compilation pipeline. 
   
   Although it might have been possible, the interaction between graph IR and TensorIR/PackedFunc has been quite tricky in Relay world. This has caused significant difficulties and non-trivial engineering efforts, IMO. Here are some representative examples:
   
   - In Relay, there has been no convenient way to optimize graph IR by using the feedback from low-level.
     - If TensorIR performs layout transformation for a primfunc, its decision will affect other primfuncs as well. However, Relay cannot provide such feedback back to graph IR-level since two different IRs cannot co-exist.
     - Graph-level tuning methods (e.g., TASO, Collage) need a capability to apply a set of passes to the part of the graph, compile/measure its performance, and provide the performance number as a feedback back to Graph-IR level to generate better candidates. Although this could be achieved by nontrivial engineering efforts, it would complicate the compilation pipeline and maintenance efforts. IMHO, joint-optimization across multiple graph tuners (e.g., TASO+Collage) would be practically impossible. 
   -  Lowering should be done at once at the boundary between Relay and TensorIR and customizing lowering has been very challenging (e.g., partial/custom lowering).
       - The main pipeline with `OpStrategy` has not been easy to customize and lower part of the graph for your own target, such as BYOC, while keeping other parts still in high-level IR. Therefore, people had to figure out the way to bypass it and apply their own lowering mechanism (e.g., `RelayToTIR`) that bypasses the main pipeline. 
       -  If you only want to apply certain schedule rules on the part of the Graph IR, you would need to lower those parts and apply schedule rules to them. However, such freedom has not been allowed for Relay main pipeline, so people had to find out workaround (e.g., use task extraction and find the primfunc among them. However, if extraction does not behave as users wanted, it would require extra engineering efforts). 
   
   Since Relax unifies abstraction, it can deliver those functionalities as compiler passes while providing flexibility and customizability. For example, since both high-level and low-level IRs co-exist, if TensorIR performs optimization decision that may have global effect, like layout transformation, we can rewrite the graph-level IR accordingly to express such change and consider its global implication. Also, lowering can be implemented as a RelaxIR->TensorIR transformation pass.  If you want to bring your own lowering mechanism, you can write a new pass. I expect you may be able to reuse most of the lowering machinery and only change the part about "how" you want to lower. 
   
   I would be happy to discuss further if you are interested in this direction. :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950369010


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef

Review Comment:
   Hi @tkonolige, it's a great question! `Expr->shape_` should be an `Expr`, `ObjectRef` is used here in its definition to prevent cyclic typing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950365718


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.

Review Comment:
   Although Relax allows users write their models directly (you can bind params as well: [example](https://github.com/tlc-pack/relax/blob/relax/tests/python/relax/test_transform_bind_params.py#L63)), the main path would be frontends for Torch, TF, ONNX,... as we do in Relay. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950373108


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+

Review Comment:
   Sure. In short, TASO is a graph-level tuning method that generates different forms of equivalent graph by trying out a set of rewriting rules (e.g., layout transformation, horizontal fusion, peephole optimization, ..). TVM provides some of those optimization heuristics, but they are not enabled by default, probably due to its potential side effects. TASO-like graph-tuners can be helpful in this direction. 
   TASO: https://cs.stanford.edu/~padon/taso-sosp19.pdf
   Another graph tuner, but with equality saturation technique: https://arxiv.org/pdf/2101.01332.pdf



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950884000


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,

Review Comment:
   The `output_shape` is implemented as a general `Expr` which can be `ShapeExpr` (single output case), or a `Tuple` (multi-output case). Yes, the shape and dtype are separated. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949615712


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+

Review Comment:
   Determining whether a given function is pure and control flow-free would not be difficult to do in Relay, or even marking portions of functions that are pure and control flow-free, would not be hard, but might require a new AST construct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1220178200

Hi @leandron, thanks for your feedback! :)

We share a common goal of minimizing disruption while incrementally improving TVM. One of the main questions is how to bring in the improvements. That’s indeed what we have carefully thought about.

One thing we found in building the unity connection is that we need to **co-design things together**. Take first-class symbolic shape support as an example:

Suppose we want to apply a `flatten` operation to flatten a tensor, here is the symbolic shape representation in Relax:

```python
b: Tensor[(m, 224)]
a: Tensor[(m * 224, )] = flatten(b)
```

In Relay, we have `?` to denote an unknown dimension. So the above program in Relay is:

```python
b: Tensor[(?, 224)]
a: Tensor[(?, )] = flatten(b)
```

Without symbolic shape, we lose the shape relation between tensor `a` and tensor `b`, which prevents good optimization opportunities for example memory planning to reuse the memory between `a` and `b` since we know at compile time they occupy the same amount of memory.

Supporting this requires introducing native symbolic shape support as opposed to a separate mechanism. It's worth noting that from the example above, the shape mechanism in Relax is very different from what we currently have in Relay.

Additionally, the other improvements also benefit from first-class symbolic shape. For example, `call_tir` signature has `output_shape` which can be represented by symbolic shape (since TensorIR supports symbolic shape), and TE language is based on symbolic shape hence we can directly generate PrimFunc with `emit_te` (see [Relax-TE integration section](https://github.com/YuchenJin/tvm-rfcs/blob/relax-upstream-rfc/rfcs/0089-relax-upstreaming.md#44-relax-tetopi-integration)). Introducing each component separately will make the design less cohesive.

Directly introducing this symbolic shape support to existing IR would mean **one-stop transition** to the current Relax, which is not incremental improvements as we hope for.

Relax can be viewed as complementary to Relay. Relay focuses on high-level op transformations, while the current Relax passes focus on TIR-graph co-transformations that can enable flexible fusion and layout rewrite, which is hard to achieve in Relay.

As of this RFC, we do not seek to change the default build pipeline or replace Relay. In this RFC, we only introduce Relax as an optional component for those community members who need it. It is a common practice in other ML frameworks, for example, PyTorch brought in TorchFX as an optional (vertical) component to support graph exporting, while maintaining TorchScript. We totally agree that TVM default flow evolution is important, and we should carefully discuss that with the community in future RFCs. [Evolving default build](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344#evolving-default-build-14) has been briefly discussed in [Establish TVM Unity Connection — A Technical Strategy](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344), and there will be an upcoming [tvm.compile RFC](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344#relation-to-upcoming-tech
nical-rfcs-2) to discuss the long-term strategy to consolidate default build flows in TVM.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] leandron commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

leandron commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r948925401


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:

Review Comment:
   Can you please shed some light on where these community needs where raised before? Is there any GitHub issue or discuss post complaining about the specific solutions provided by the proposed RFC?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:

Review Comment:
   Can you clarify what `performance` means in this context, specifically?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+

Review Comment:
   Thanks for providing the context on dataflow graphs, but can you also explain what work was done to compare the cost of introducing this as a new feature in a new IR, when compared to adding it to Relay? What are the fundamental issues?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:

Review Comment:
   Am I correct to understand that this introduces a 4th built-in runtime in TVM, together with the Graph Runtime, AoT runtime and Relay VM? So is it the only way to run Relax programs as-is, to use this proposed new runtime?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.

Review Comment:
   All these runtime features seem pretty much Runtime independent, meaning they could be implemented in the existing runtimes. Taking Relay VM as an example, why couldn't this be implemented as a feature in the existing runtime?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.

Review Comment:
   As a project, I think it creates some challenges on how to deal with new features being proposed and being introduced in **Relay**, by the established set of contributors that were not involved in Relax development.
   
   While it is clear in the RFC that Relay and Relax would share some infrastructure, what I'm taking from this RFC, is that it will only set the core components for Relax, not having feature parity with Relay at this point, nor support in other components such as all runtimes.
   
   How the proposed plan intends to deal with such new features being added in Relay, while the intention here it to - at some point - replace Relay with Relax?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+

Review Comment:
   Can you clarify on why having "a graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc" is something we can't achieve with adding this feature on existing Relay?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.

Review Comment:
   Shape deduction seems like a feature which is independent of IR. Why can't this be supported as a feature in Relay?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).

Review Comment:
   Is there an expected timeline by which you plan to catch up on this support for the other runtimes? My worry is that it might create a disparity with regards to `[features] * [IR] * [runtime]` that is hard to track and explain to users.



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.

Review Comment:
   Can you clarify on why we couldn't add these are new features to Relay? It is not clear in the text what analysis was done in this particular regard.



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).

Review Comment:
   Is the "json" Graph Runtime is missing from this list? Isn't this runtime expected to be supported?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.

Review Comment:
   Can you make this paragraph a bit more specific to list the critical needs raised by the community, that are addressed by Relax?
   
   Reading the link attached, it is not clearly describing the need, as it is also a description of features and milestones in a broader context than Relax. 



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.
+- Calls into `te.create_prim_func` to create a TIR PrimFunc.
+- Generates a call into the generated TIR PrimFunc via `call_tir`.
+
+Bridging Relax and TIR is simple and clean given that Relax has symbolic shape as first class and the support for `call_tir` for cross-layer interactions.
+
+**Relay → Relax translator**
+
+To immediately boost the coverage of models and leverage existing relay optimizations, a Relay-to-Relax translator is implemented. The translator visits the Relay graph in post-order, lowers relay ops to their TOPI functions using `OpStrategy`, and uses `emit_te` to generate corresponding TIR PrimFuncs and a Relax `main` function that contains a sequence of `call_tir` that call into these generated TIR Primfuncs. 
+
+## 4.5 PR list
+
+We plan to split the upstreaming into the following manageable PRs for TVM community review:
+
+- Relax IR
+- Relax VM
+- BlockBuilder
+- ExprFunctor/ExprVisitor/ExprMutator/IRFunctor
+- Relay → Relax translator
+- Minimum build (4 passes)
+- VM Codegen
+- E2E model compilation + execution
+
+# 5. **Future work**
+
+This RFC only focuses on the foundation part of Relax. After it, we will incrementally incorporate the additional capabilities and features. Relax aims to achieve parity with the functionality provided by Relay: this means that workloads which are functional on Relay will also be functional on Relax, even though the infrastructure underneath may change.
+
+Future plans that we will bring in future RFCs:

Review Comment:
   Currently all the framework frontends support Relay. What's the plan with regards to Frontends and Relax? I think it should be covered in this section.



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.
+- Calls into `te.create_prim_func` to create a TIR PrimFunc.
+- Generates a call into the generated TIR PrimFunc via `call_tir`.
+
+Bridging Relax and TIR is simple and clean given that Relax has symbolic shape as first class and the support for `call_tir` for cross-layer interactions.
+
+**Relay → Relax translator**
+
+To immediately boost the coverage of models and leverage existing relay optimizations, a Relay-to-Relax translator is implemented. The translator visits the Relay graph in post-order, lowers relay ops to their TOPI functions using `OpStrategy`, and uses `emit_te` to generate corresponding TIR PrimFuncs and a Relax `main` function that contains a sequence of `call_tir` that call into these generated TIR Primfuncs. 
+
+## 4.5 PR list
+
+We plan to split the upstreaming into the following manageable PRs for TVM community review:
+
+- Relax IR
+- Relax VM
+- BlockBuilder
+- ExprFunctor/ExprVisitor/ExprMutator/IRFunctor
+- Relay → Relax translator
+- Minimum build (4 passes)
+- VM Codegen
+- E2E model compilation + execution

Review Comment:
   Many on this list look like pretty big chunks of work, when looked in isolation.
   
   Is it possible to expand this section with further breakdown on the upstreaming strategy here, so that we can identify work that we might want to split out in, perhaps, separate RFCs so that we can have more visibility on the actual impact on the compiler stack create by each component?
   
   At least, I'd ask for more low level detailing on these components. Feel free to add more:
   - Relax IR
   - Relax VM
   - Relay → Relax translator
   - VM Codegen



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.
+- Calls into `te.create_prim_func` to create a TIR PrimFunc.
+- Generates a call into the generated TIR PrimFunc via `call_tir`.
+
+Bridging Relax and TIR is simple and clean given that Relax has symbolic shape as first class and the support for `call_tir` for cross-layer interactions.
+
+**Relay → Relax translator**
+
+To immediately boost the coverage of models and leverage existing relay optimizations, a Relay-to-Relax translator is implemented. The translator visits the Relay graph in post-order, lowers relay ops to their TOPI functions using `OpStrategy`, and uses `emit_te` to generate corresponding TIR PrimFuncs and a Relax `main` function that contains a sequence of `call_tir` that call into these generated TIR Primfuncs. 
+
+## 4.5 PR list
+
+We plan to split the upstreaming into the following manageable PRs for TVM community review:
+
+- Relax IR
+- Relax VM
+- BlockBuilder
+- ExprFunctor/ExprVisitor/ExprMutator/IRFunctor
+- Relay → Relax translator
+- Minimum build (4 passes)
+- VM Codegen
+- E2E model compilation + execution
+
+# 5. **Future work**
+
+This RFC only focuses on the foundation part of Relax. After it, we will incrementally incorporate the additional capabilities and features. Relax aims to achieve parity with the functionality provided by Relay: this means that workloads which are functional on Relay will also be functional on Relax, even though the infrastructure underneath may change.

Review Comment:
   Just for clarity: is the ambition for Relax to completely replace Relay at some point?
   
   If yes, when in terms of timeline would you see Relax being feature compatible with Relay?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation
+    
+    (We can use `alloc_tensor` to achieve the same goal)
+    
+- `relax.runtime.builtin.store_shape(shape, heap, idx0, ...)`: Store a shape into specific indices in the shape heap.
+- `relax.runtime.builtin.load_shape(heap, idx0, ...) -> shape`: Construct a shape from the shape heap according to the indices.
+
+Program after shape lowering:
+
+```python
+@R.function
+def relax_function(x):
+    shape_heap = relax.call_packed("vm.builtin.alloc_shape_heap", size=k) 
+    relax.runtime.builtin.store_shape(x.shape, shape_heap, 0, 1)
+    sh = relax.runtime.builtin.load_shape(shape_heap, 0, 1)
+    # this product_shape function (to compute n*m) is generated as TIR primfunc when visiting ShapeExpr in the shape lowering pass
+    shape_size = product_shape(sh) 
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv0 = relax.runtime.builtin.alloc_tensor(storage0, sh, 0, "float32")
+    R.call_packed("tirexp"), x, gv0)
+		
+    sh1 = R.call_packed("load_shape"), heap, 0, 1)
+    storage1 = relax.runtime.builtin.alloc_storage(size=shape_size, device=cpu)
+    gv1 = relax.runtime.builtin.alloc_tensor(storage1, sh1, 0, "float32")
+    R.call_packed("flatten"), gv0, gv1)
+		
+    R.call_packed("free_tensor"), gv0)
+    R.call_packed("free_storage"), storage0)
+    return gv1
+```
+
+## 4.4 Relax-TE/TOPI integration
+
+Relax brings support of directly embedding TIR functions through `call_tir`. However, it is still hard to manually construct TIR functions through TVMScript. In Relax, we can reuse libraries such as TOPI (pre-defined TE functions) for quick workload creation and operator lowering. 
+
+The Relax-TE integration is very unique to Relax because the TE language in TVM is also based on symbolic shape. For exmaple, the following code uses `te.var` to create symbolic dimension variables whose values can be specified during execution:
+
+ 
+
+```python
+n = te.var(name='n')
+A = te.placeholder((n,), name='a')
+B = te.placeholder((n,), name='b')
+C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
+```
+
+Since Relax also has symbolic shape as first class (D1 in Section 3), Relax can directly integrate with TE and TOPI library.
+
+![relax-emit-te](../resources/relax-emit-te.png)
+
+The above code snippets demonstrate how users can build an end-to-end workload by leveraging TOPI and TE.  The left side of the above diagram uses `relax.BlockBuilder` API to incrementally build the IRModule as shown in TVMScript on the right. 
+
+The Relax BlockBuilder has a member function `emit_te` as highlighted in the program on the left. `emit_te` takes the following arguments:
+
+- a TE function
+- Relax variables that define the input tensors (for example the input and weight variables)
+
+`emit_te` then does the following:
+
+- Creates `te.placeholder` for the input Relax variables (e.g. input and weight)
+- Schedules the TE/TOPI function (`topi.matmul` in this case) using those `te.placeholder`.
+- Calls into `te.create_prim_func` to create a TIR PrimFunc.
+- Generates a call into the generated TIR PrimFunc via `call_tir`.
+
+Bridging Relax and TIR is simple and clean given that Relax has symbolic shape as first class and the support for `call_tir` for cross-layer interactions.
+
+**Relay → Relax translator**
+
+To immediately boost the coverage of models and leverage existing relay optimizations, a Relay-to-Relax translator is implemented. The translator visits the Relay graph in post-order, lowers relay ops to their TOPI functions using `OpStrategy`, and uses `emit_te` to generate corresponding TIR PrimFuncs and a Relax `main` function that contains a sequence of `call_tir` that call into these generated TIR Primfuncs. 
+
+## 4.5 PR list
+
+We plan to split the upstreaming into the following manageable PRs for TVM community review:
+
+- Relax IR
+- Relax VM
+- BlockBuilder
+- ExprFunctor/ExprVisitor/ExprMutator/IRFunctor
+- Relay → Relax translator
+- Minimum build (4 passes)
+- VM Codegen
+- E2E model compilation + execution
+
+# 5. **Future work**
+
+This RFC only focuses on the foundation part of Relax. After it, we will incrementally incorporate the additional capabilities and features. Relax aims to achieve parity with the functionality provided by Relay: this means that workloads which are functional on Relay will also be functional on Relax, even though the infrastructure underneath may change.
+
+Future plans that we will bring in future RFCs:
+
+- AOT: AOT compilation has a wide range of benefits such as being more space efficient and is necessary for resource-constrained projects like uTVM. We are committed to continuously supporting the AOT compilation in Relax, and there is an ongoing effort to connect Relax to the current AOT executor.
+- BYOC: We will try to reuse the existing translation spec. In Relax, BYOC can be formalized as a pass that call external packed functions.

Review Comment:
   Given this RFC proposed an alternative to a fundamental part in TVM, it would be good to have some idea on how BYOC and AoT will be supported, as well as some idea of the timelines.
   
   I understand that at the present RFC, Relax intends to be an optional compilation flow, but I think we can agree that we don't intend to maintain two overlapping IRs long term, as it would become too costly in terms of infrastructure.
   
   Based on that, can you specifically clarify on whether there is an expected timeline for AoT and BYOC to be brought into Relax?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

comaniac commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1221091408

Thanks fo the RFC. Although I didn't involve the actual Relax development, I've been attended the weekly open design review meeting for a while and I'm glad that I could share our experience to help improve the Relax design. Thus, I don't have specific questions to the design.

Regarding to the point that mentioned above about whether we really need a brand new IR to replace Relay, in fact, we at AWS already attempted to build a compiler-based training framework, [RAF](https://github.com/awslabs/raf), by extending Relay. Based on our experience working on RAF for the past 1.5-2 years, we agreed with the Relax group that Relay does have some limitations in its design, and these limitations prevent us from easily adding some new features such as dynamic shape, flexible fusion, etc. Note that I'm not saying it's impossible to implement these feature by extending Relay. My point is even it's possible, it is hard to keep a clean system/infra without a large scale refactoring. To me, it is even safer to build a new infra from scratch, so that existing workloads and use cases won't be affected at all. This is also the reason why we noticed Relax and tried our best to involve the early stage design in the first place.

Meanwhile, in terms of maintaining two IRs, I don't really think this would add many overheads to the community, because these two IRs are basically independent and can be developed separately. In other words, it's up to the developers to work on Relay or Relax. As long as there's still a number of Relay developers in the community, we shouldn't deprecate it. On the other hand, if people found that Relax is better over time, then developers will gradually move to Relax. Until then, we will bring the Relay deprecation on the table, just like nnvm in the past.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tkonolige commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

tkonolige commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949633350


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 

Review Comment:
   This document contains no section about typing or type inference. How does it work? Would it be worthwhile to add a section about it?



##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef

Review Comment:
   Above `shape_` is defined as an (implicitly optional) `Expr` but here it is defined as `ObjectRef`. Is this because you want it to be implicitly optional? If, so I'd recommend explicitly optional instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950360721


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.

Review Comment:
   Let me start with my comments for the previous comment :) 
   > Relax would provide more unified and organized support for BYOC. In Relay, although the functionalities are provided, IMHO, it has been quite fragmented so it has been tricky to use multiple BYOCs together with TVM's pipeline. So, in relax, our goal is to
   > (1) Support the existing functionalities
   > (2) Organize and simplify BYOC workflow
   
   Personally, my ultimate goal is to unlock Relax to allow the easy customization of both TVM internal and BYOC pipeline. For example, we may want to rewrite the graph IRs while considering low-level optimizations for TIR and multiple BYOC choices. (e.g., for `Matmul+add`, we have bunch of choices: Tune with MetaSchedule, use BYOC like TensorRT, Cutlass, Cublas...). And some of recent BYOCs, such as Cutlass, are tunable. To enable this functionalities,  pipeline should be easy to customize, straightforward, and maintainable for the future extension.  
   
   This has been difficult to achieve in Relay for many reasons: for example, inflexible lowering, each BYOC has its own greedy partitioning without considering other potential BYOCs, no interaction between BYOC and main pipeline, etc. 
   
   In relax, although it is still work-in-progess, we've seen some great signals. I would be happy to discuss further if you are interested in. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tkonolige commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

tkonolige commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949627983


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+

Review Comment:
   I think relay is pure except for `Ref`s. And in practice they are not used because they are poorly supported by the compiler.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950306900


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).

Review Comment:
   Thanks for the catch. It will be supported and [relax repo](https://github.com/tlc-pack/relax) has demonstrated the functionality of JSON runtime w/ TensorRT. Maybe good to clarify this in the RFC. cc. @YuchenJin 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950339476


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.

Review Comment:
   Thanks for the comment! Yes, it has been supported in Relay but there are some nontrivial limitation, imo. 
   
   - (1) Relay main pipeline lowers every Relay IRs into TIR at once at their IR boundary. This makes partial lowering (lower only part of the graph) difficult in the main pipeline. 
   - (2) Relay main pipeline supports lowering with `OpStrategy`. However, it is not necessarily easy to customize it (custom lowering)
   
   For these reasons, people introduced `RelayToTIR` and `RelayToRuntime` that essentially bypass the main pipeline. Although it enables the functionalities people want, it is hard to maintain them as a framework and it is not easy if you want to leverage multiple lowering strategies in the incremental way. Therefore, Relax wants to tackle down this problem and provide such supports in an organized systematic way. For example, since Relax provides unified abstraction, we can introduce GraphIR->TIR transformation into the pipeline and this is essentially what lowering does. Thus, by introducing such mechanism as a Relax->TIR transformation pass, Relax can bring those functionalities into the main pipeline in a customizable manner. We expect users may be able to reuse most of lowering machinery since most of times, you just want to change "how to lower" part. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhiics commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

zhiics commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1272328462

Based on my experience at several organizations, dynamic shape support is obviously very important, particularly along with the popularity of large language models. Also, efficiently supporting dynamic shape would be one of the major appealing features of a "modern" DLC. I think the above comments have also reached the agreement of importance of dynamic shape. The major argument is whether we need to have separate PRs to land this feature.

IMHO, Relax is already one of the components of Unity, and the current proposal again only contains the most value part of Relax which provides a minimum E2E compilation flow to enable the support of a dynamic model. This somehow has been working well before in both TVM and other open source project since the component doesn't blocking/breaking the current uses/deployment. For example, the first version of Relay also had IR, simple lowering, and necessary passes to quickly unblock the users/developers (e.g. AWS) who want to give it a try. Afterwards, we iterated on it many times to improve both the design and implementation.

As a contributor of TVM, I would encourage we focus more on the design itself and spot the design flaws and the missing key features that we should address so that users (some of them are already waiting for Relax as mentioned here) can quickly check it out and bring us back with more insightful feedback or directly contribute to the project.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] masahi commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

masahi commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r952462198


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation

Review Comment:
   Suggest to use different term than "heap", since it is used very differently from the conventional CS sense ("heap" as in stack / heap). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r952934825


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.

Review Comment:
   `call_tir` allows a `PrimFunc` to be called from a Relax program. It allocates the destination tensor first and invokes the `PrimFunc` in destination-passing style. It's implemented as an operator in Relax, so it's a specific kind of call. Normal `Call` nodes do not perform that allocation behavior and do not assume a destination-passing style convention (`call_tir` is really just syntactic sugar for performing the allocation and making the call).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949758210


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:

Review Comment:
   Of course, I'll list some relevant discussion posts that I know of:
   
   **Needs for unified abstractions:**
   
   - Please see the comments left by community members on this thread: [https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344)
   - [https://discuss.tvm.apache.org/t/new-relay-operator-based-on-tensorir/11999](https://discuss.tvm.apache.org/t/new-relay-operator-based-on-tensorir/11999)
   
   **Needs for dynamic shapes support:**
   - [https://discuss.tvm.apache.org/t/dynamic-input-output-shapes-batch-size/12689](https://discuss.tvm.apache.org/t/dynamic-input-output-shapes-batch-size/12689)
   - [https://discuss.tvm.apache.org/t/does-tvm-support-dynamic-input-shape/11069](https://discuss.tvm.apache.org/t/does-tvm-support-dynamic-input-shape/11069)
   - [https://discuss.tvm.apache.org/t/cannot-allocate-memory-symbolic-tensor-shape/12514](https://discuss.tvm.apache.org/t/cannot-allocate-memory-symbolic-tensor-shape/12514)
   
   
   **Needs for supporting dataflow block and handling side effect:**
   - https://discuss.tvm.apache.org/t/basic-block-normal-form/5908
   - https://discuss.tvm.apache.org/t/rfc-handling-effect-in-tvm-and-relay/5946
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhiics commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

zhiics commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r951022122


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)

Review Comment:
   Got it. Thanks for clarification.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950882628


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1

Review Comment:
   Different from Relay, shape in Relax is not part of tensor type. So the type does not depend on the shape value, which allows for more flexibility for handling dynamic shapes.
   
   In this particular case, `lv3` is bound to a `ShapeExpr` (shape expression), so it’s inferred to be of `ShapeType` (a type introduced by Relax) during type inference. Then it can be used in the following program where expects a `ShapeType` value, for example, in `match_shape` and `call_tir`’s output_shape.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r949611408


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.

Review Comment:
   Representing shapes is part of the language, so it is not independent of IR. Relay handles shapes in its type relation system and it would be very difficult to support symbolic shapes in Relay's system, since those are supposed to be checked dynamically. (Right now, Relay has an `Any` shape for dimensions that should be checked dynamically, but that's very all-or-nothing; you can't specify finer-grained shape constraints that way.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] Hzfengsy commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

Hzfengsy commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950155616


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+

Review Comment:
   As @slyubomirsky said, it's a decision if we should make a new IR or add features to Relay.
   
   If we want to add it to relay, we should:
   1. add new AST nodes
   2. rewrite the build pipeline (i.e. we should first include tir function to the Module for further analysis)
   3. rewrite most of the passes (i.e. we need to do optimization with a Module that has both tir and relay function)
   
   Based on that, making a new IR may be more reasonable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950509731


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.

Review Comment:
   Relax is not a superset of Relay _per se_, though it happens to reuse some of the AST implementation. In particular, the type system and encoding of shapes are entirely different.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] masahi commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

masahi commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r952462198


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.
+- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations.
+- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR.
+- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation.
+- Provide a flexible way to incorporate TensorIR and existing libraries such as cuDNN.
+
+Through this unified interface, ML researchers, system engineers, and hardware vendors can collaborate better, since we can incrementally optimize and translate specific parts of the whole program in Relax.
+
+## D1: ****Shape deduction as first-class computation****
+
+Shape deduction is essential to compiling dynamic workloads. Under a dynamic shape setting, the destination-passing call style adopted by call_tir and call_dps_packed requires that the shapes of the output tensors are computed. We can solve this challenge by invoking a function to compute the shape before calling the operator function. However, there are also cases where the shape itself is data-dependent (e.g. `unique` operation used to select the unique elements of a tensor). Finally, since most dynamic shape workloads still contain a lot of (partially) static shapes, ideally we want to take benefit of this static shape information for optimization.
+
+In Relax, a shape constraint of a tensor is represented by two fields of the `relax.Expr`(`RelayExpr`).
+
+- `checked_type_: Type`, stores the generic rank and dtype constraints.
+- `shape_: Expr`, stores ways to compute shape of the expression at runtime. It’s `nullptr` when the expression’s `checked_type_` is not `DynTensorType`(meaning the expression is not a Tensor). Otherwise, this `shape_` field takes one of the 3 possible types outlined below.
+
+**checked_type_**
+
+`Expr→checked_type_` stores the compile time deduced type of an expression. We introduce a new type `DynTensorType` to represent the type of a Relax tensor Expr, which contains the following two fields:
+
+```python
+class DynTensorType(Type): 
+    ndim: int # ndim=-1 means unknown rank
+    dtype: DataType # dtype=DataType::Void() means unknown dtype
+```
+
+**shape_**
+
+`DynTensorType` does not contain shape information. Instead, the shape of a Tensor is stored in an **optional** `shape_` field in an Expr.
+
+For an `Expr x`, `x.shape_` can contain the following values:
+
+- V0: `ShapeExpr` (see Section 4.1 for its definition), which contains an `Array<PrimExpr>`. Static shapes are always represented in this form by encoding each dimension as `IntImm`. Symbolic shapes can also be represented (see section 4.1 for more).
+- V1: Generic `Expr`, which is expected to, at runtime, result in something of type `Shape`. The `Expr` can call into opaque (shape) functions, or shape deduction intrinsics.
+- V2: `RuntimeDepShape` (see Section 4.1 for its definition), a special `Expr` to indicate that shape is unknown at compile time and cannot be determined at runtime without producing the attached Tensor (see Safety Net section for its handling).
+
+The following program covers typical scenarios in shape deduction (marked in comments). Importantly, shape is now part of the computation along with Tensor values. This reflects the fact that the computation of shapes can happen at runtime.
+
+```python
+from tvm.script import relax as R
+
+@R.function
+def shape_example(x: R.Tensor[(n, 2, 2), "float32"]):
+    with R.dataflow():
+        # V0: symbolic and static shape deduction
+        lv0: R.Tensor[(n, 4), "float32"] = R.reshape(x, (n, 4))
+        lv1: R.Tensor[(n * 4,), "float32"] = R.flatten(lv0)
+        lv2: R.Shape = (n * 4,)
+
+        # V1: external opaque shape function
+        lv3: R.Shape = R.call_packed("myshape_func", lv2)
+        lv4 = R.call_tir("custom_func", (lv1,), lv3, dtype="float32")
+
+        # V2: runtime dependent case: _ is used to represent RuntimeDepShape
+        lv5: R.Tensor[_, "float32"] = R.unique(lv4)
+
+        # re-match shape
+        lv6: R.Tensor[(m,), "float32"] = R.match_shape(lv5, (m,))
+        lv7: R.Shape = R.match_shape(lv3, (m,))
+
+        gv0: R.Tensor[(m,), "float32"] = R.exp(lv6)
+        R.outputs(gv0)
+
+    return gv0
+```
+
+While the text format type annotation `lv0: R.Tensor[(n, 4), "float32"]` shows the shape of each value, this is only syntactic sugar. From the IR’s point of view, the `shape_` field `(n, 4)` is not included in the type signature of `lv0`. The type signature of `lv0` is `DynTensor(rank=2, dtype="float32")`, and the shape is a special value field that is attached to each `Expr`. We made this explicit choice to simplify the type inference so that we do not need to get into the [dependent typing](https://en.wikipedia.org/wiki/Dependent_type) land where type depends on value (shape in our case) which requires heavier machinery to handle. 
+
+**match_shape**
+
+After a data-dependent computation (like `unique`) or external calls, we may need to be able to recover/refine the shape information to enable more optimizations. The `match_shape` construct is used to perform such refinements.
+
+`var: Var = match_shape(value: Expr, pattern: List[PrimExpr])`
+
+The match_shape construct takes a **value** and a **pattern** (a list of `PrimExpr`, for example `(m, n)`), and returns a **var**. It has two overloaded semantics:
+
+- When value is a Tensor, it matches `value.shape` to the pattern, populates the corresponding symbolic integer variable if it occurs in the pattern for the first time in the scope, and then returns a new Tensor that is the same as value but the shape field is updated to the pattern. In the V2 case in the above code snippet, `R.match_shape(lv5, (m,))` defines a symbolic TIR variable `m`, and matches tensor lv5’s shape with the pattern `(m,)`.
+- When value is a Shape (for example `lv7: R.Shape = R.match_shape(lv3, (m,))` in the above code snippet), it directly matches the pattern, and returns a Shape. This is useful when we want to isolate out shape functions that do not correspond to any Tensor value.
+
+**Safety Net (handle `RuntimeDepShape`)**
+
+While fixed rank, dynamic symbolic shape relation covers most of the use cases. Inevitably we also need to be able to cover general cases that may not fall into the category:
+
+- C0: Dynamic shape relations where output shape is data dependent on the input (e.g. `unique` operator).
+- C1: Rank of a tensor is not known (can happen in rare cases of loops).
+- C2: dtype of a tensor is not known.
+- C3: Other cases, opaque runtime objects for low-level libraries(e.g. PRNG handle, cuDNN context).
+
+As a result, it is important to have a "safety net" solution so that we cover the general cases.
+
+Suppose we have a `unique` operation which we cannot deduce the return tensor’s shape at compile time:
+
+`y: R.Tensor[_, _] = R.unique(x)`
+
+During lowering, this call won't get translated into destination passing style, because it is impossible to obtain the shape information and pre-allocate the memory. Instead, they are directly translated to calls that allocate and return the result tensor.
+
+- `R.unique` can be mapped to a runtime PackedFunc calls that takes in an NDArray x and perform an unique operation.
+    - We can even dispatch to common runtime libraries such as `torch.unique`, for exmaple the above `R.unique(x)` can be lowered to `call_packed(”torch.unique”, x)`.
+
+These features are supported by Relax VM as PackedFunc calls that return TVM Object. We can bring the tensors from no shape computation land to the shape-aware land using match_shape. The no shape computation is by no means the most effective way to handle things. It is necessary for cases like data-dependent calculation and interfaces with external libs that have weaker shape information.
+
+## D2: ****Dataflow block as a first-class construct****
+
+Most machine learning models can be represented with a **pure**/**side-effect-free** computational graph. An operation is pure or side-effect free ****if: it only reads from its inputs and returns the result via its output, it will not change other parts of the program (such as incrementing a global counter).
+
+A **dataflow graph** means every operation inside is **side-effect free** and there are no **control flows** (such as if-then-else). A **dataflow block** is a way for us to mark the dataflow graph regions of the program in Relax. Specifically, all the operations under the dataflow block are side-effect-free and do not contain control flows (control flow is an advanced semantic that most pass writers do not consider). Outside a dataflow block, operations can contain side effects (for example doing in-place weight update during model training) and control flow. The program below is an example program that contains two dataflow blocks.
+
+```python
+@R.function
+def main(x: R.Tensor((1, 784), "float32"), 
+         w: R.Tensor((128, 784), "float32"), 
+         b: R.Tensor((128,), "float32")):
+
+    with R.dataflow():
+        # linear and relu are PrimFuncs in the same IRModule
+        lv0 = R.call_tir(linear, (x, w, b), (1, 128), dtype="float32")
+        gv0 = R.call_tir(relu, (lv0,), (1, 128), dtype="float32")
+        R.output(gv0)
+
+    R.call_packed("custom_inplace_update", gv0)
+    gv1 = R.read_tensor_from_file("tensor.txt")
+
+    with R.dataflow():
+        out = R.call_tir(linear1, (gv0, gv1, b), (1, 128), dtype="float32")
+        R.output(out)
+    return out
+```
+
+A dataflow block can effectively be **viewed as a computational graph** embedded in the program.
+
+Binding variables assigned in a dataflow block are by default local to the dataflow block, and these variables can be viewed as “internal nodes” of the computational graph. When those variables are needed outside the scope of that dataflow block (output nodes in the computational graph), they must be explicitly output using `R.output()`. In the example above, `lv0` is local to its dataflow block and can’t be referenced outside the block. `gv0` can be referenced directly via its name in the surrounding scope because it has been `R.output`.
+
+In the above relax function, `R.read_tensor_from_file`, and `R.call_packed` all have side effects, so they reside outside of the dataflow block. Anything that is outside of a dataflow block may have side effects, so we cannot perform optimizations such as reordering these bindings according to topological order unless we do more careful analysis. 
+
+We expect most of the optimizations are graph rewriting, which happens inside dataflow blocks, and most existing optimization passes in TVM could also be converted to the dataflow block level too. These optimizations can be done by ML engineers who are familiar with the computational graph concept. The ability to isolate and represent effectful components also provides opportunities for more advanced optimizations for the places that need them.
+
+# 4. **Reference-level explanation**
+
+To achieve the design points described in the last section, this RFC focuses on how to build a **end-to-end MVP** (Minimum Viable Product) which allows the users to construct an end-to-end model (represented by IRModule), transform/build the IRModule, and run the execution.
+
+As shown in the diagram below, users can construct a Relax AST either by writing TVMScript or via Relay-to-Relax IR translator, and then compile the Relax AST via the Relax minimum compilation flow to generate an executable module, and run it on a runtime. Other components in the TVM stack such as TIR, TOPI, TVM FFI are **shared** between Relay and Relax. We need three major components to put together an end-to-end MVP as shown on the right side in the diagram: **Relax AST**, **Relax runtime**, and **Relax minimum compilation flow**. This section illustrates the underlying techniques for these three components.
+
+<p align="center">
+    <img src='../resources/relax-e2e-flow.png' width='600'>
+</p>
+
+## 4.1 Relax AST
+
+To support the key design points in the last section, Relax introduces the following constructs to the AST. In the meantime, we reuse `RelayExpr`, `Call`, `Constant`, `Tuple`, `If`, `Op`, `GlobalVar`, `TupleGetItem` in Relay.
+
+```python
+class Expr(BaseExpr):
+    """This is RelayExpr, but we add a shape_ field."""
+    checked_type_: Type
+    shape_: ObjectRef
+
+class ShapeExpr(Expr):
+    """corresponds to a shape containing symbolic PrimExpr"""
+    values: List[PrimExpr]
+
+class RuntimeDepShape(Expr):
+    """represents a runtime-dependent shape
+    Sometimes shape of a tensor cannot be deduced statically either
+    because the shape is truly data dependent such as output of
+    `unique` operator or cannot be deduced due to limited shape
+    inference capability.
+    """
+    pass
+
+class Var(Expr):
+    """a function/SeqExpr scope visible variable that can be bound to other Expr"""
+    vid: Id
+    type_annotation: Optional[Type]
+
+class DataflowVar(Var):
+    """a specific type of Var that only has dataflow scope visibility"""
+    pass
+
+class Binding(Node):
+    """the base class of bindings"""
+    pass
+
+class VarBinding(Binding):
+    """variable bindings, bind the value to the var"""
+    var: Var
+    value: Expr
+
+class MatchShape(Binding):
+    """A type of binding which represents to matching a shape
+    Example: MatchShape(x, [m, n], var)
+    means matching Tensor x's shape to symbolic variables (m, n),
+    and returns a 2-D tensor with the same shape as tensor x (but with
+    explicit shape field [m, n]) to the output *var*;
+    """
+    value: Expr
+    pattern: List[PrimExpr]
+    var: Var
+
+class BindingBlock(Node):
+    """base class of binding block, bindings inside can be impure (with side effect or control flow)"""
+    bindings: List[Binding]
+
+class DataflowBlock(BindingBlock):
+    """dataflow block, bindings inside are pure (side-effect-free and no control flow)"""
+    pass
+
+class SeqExpr(Expr):
+    """sequence of BindingBlocks, can serve as the body of a Function"""
+    blocks: List[BindingBlock]
+    body: Expr
+
+class Function(BaseFunc):
+    """represents a Relax function"""
+    params: List[Var]
+    body: Expr   
+    ret_type: Type
+
+class ExternFunc(BaseFunc):
+    """extern function, which represents a PackedFunc, used in call_packed."""
+    global_symbol: String
+```
+
+With Relax IR, the overall structure of a Relax function is as follows:
+
+
+<p align="center">
+    <img src='../resources/relax-function-structure.svg' width='350'>
+</p>
+
+- Relax has first-class function support. A `Function`'s body can be any `Expr`, and Relax has an explicit data structure to handle binding blocks —`SeqExpr`, which usually serves as a Function’s body.
+- A `SeqExpr` contains a list (sequence) of `BindingBlock` and a `body` expression.
+- `DataflowBlock` is a special kind of `BindingBlock` that is identical to a pure computational graph. The bindings inside `DataflowBlock` have no side effects and no control flow.
+- A `BindingBlock` consists of a list of `Binding`.
+- `Binding` can be either `VarBinding` or `MatchShape`.
+- The scope of a `DataflowVar` is its `DataflowBlock`, a normal `Var` in a `DataflowBlock` escapes to the scope containing the block (which could be the function scope or some other scope like an *if* branch). Note that TIR variables (bound by `MatchShape`) have the same scoping rules as normal `Var`.
+- A `SeqExpr` is evaluated as follows: Each binding block in its `BindingBlock` is evaluated, and then the `body` expression is evaluated—the result of evaluating the body is the result of evaluating the SeqExpr.
+
+Let's take the following relax program as an example, `relax_func` contains a `SeqExpr`, the `SeqExpr` contains a `DataflowBlock` (with 2 `VarBinding`) and a `BindingBlock` with one `VarBinding`.
+
+```python
+from tvm.script import relax as R
+
+@R.func
+def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[(k, m), "float32"]):
+    # start a DataflowBlock
+    with R.dataflow(): ## <= DataflowBlock
+        lv0: R.Tensor[(n, m), "float32"] = R.dot(x, w) ## <= VarBinding, lv0 is a DataflowVar
+        gv0: R.Tensor[(n * m,), "float32"] = R.flatten(lv0) ## <= VarBinding, gv0 is a Var that escapes to the outer scope
+        R.outputs(gv0)
+
+    # start a BindingBlock
+    gv1 = R.call_packed("custom_inplace_update", gv0) ## <= side-effect binding
+    return gv1
+```
+
+## 4.2 Relax runtime
+
+For the ease of implementation and flexibility to support dynamic workloads, we start with a flexible register-based VM runtime similiar to the Relay VM but with two distinctions:
+
+- Minimal instruction set (including Call, Ret, If, Goto):
+    - **Call** **Instruction**(packed function invocation) as the core instruction, since eventually TIR is also compiled to PackedFuncs.
+    - Builtin packed function library to bridge the IR and runtime (e.g., `shape_of(tensor)` is one of the builtin packed functions to be invoked with the **Call** **instruction** to get the shape of a tensor).
+- Do shape calculations via shape heap (an internal NDArray) manipulation.
+    - Suppose Tensor A's shape is (m, n) at compile time, and in the Relax program we want to compute (j, k) = (m+1, n+1). At runtime, A's shape will be stored in index 0 and index 1 of a shape heap(which is a TVM NDArray) via calling the vm builtin function `store_shape(A.shape)`. m+1 and n+1 will be computed by a TIR Primfunc generated in the shape lowering pass, and j and k will be stored at index 2 and 3 of the shape heap. Please refer to the shape lowering pass in the next subsection for more details.
+
+As future plan, we will consolidate Relay VM and Relax VM, and integrate Relax with the AOT executor (see Section 5).
+
+## 4.3 Relax minimum compilation flow
+
+In Relax, we need to ensure a unified and minimum build that maps an IRModule → runtime.Module. This minimum build is capable of building any valid IRModule no matter what transformations have been applied to the IRModule. This design decouples the optimization passes from the minimum build, which will enable flexible and customizable compilation pipelines without the need to hack into the core of the compiler, and allow the users to explore new space.
+
+Relax compilation flow is designed with the following goals:
+
+- Compile Relax program to a format that the Relax runtime can directly execute.
+- A compilation pipeline that enables composable transformations:
+    - Every transformation is a `IRModule` → `IRModule` transformation.
+    - Users might run part of the program with third-party libraries such as cuDNN. We need to be capable to optimize the left part.
+
+Let's take compiling the following simple Relax program as a running example.
+
+```python
+import tvm.script
+from tvm.script import tir as T, relax as R
+
+@tvm.script.ir_module
+class MyIRModule:
+    @T.prim_func
+    def tirexp(a: ty.handle, b: ty.handle):
+        n1, m1 = T.var("n1"), T.var("m1")
+        X = T.match_buffer(x, (n1, m1))
+        Y = T.match_buffer(y, (n1, m1))
+        with T.block(n1, m1) as i, j:
+            Y[i, j] = T.exp(X[i, j])
+    
+    @R.function
+    def relax_function(x: R.Tensor[(n, m)]):
+        with R.dataflow():
+            lv0: R.Tensor[(n, m)] = R.call_tir(tirexp, (x,), (n, m), dtype="float32")
+            gv0: R.Tensor[(m*n,)] = R.call_tir("flatten", (lv0,), (m*n,), dtype="float32")
+            R.outputs(gv0)
+
+        return gv0
+```
+
+There are two challenges to lowering a Relax program to Relax VM instructions:
+
+- C0: Every `call_tir` needs to be lowered because Relax runtime only supports calling a packed function directly → We need to insert explicit memory allocation for each `call_tir`.
+- C1: The symbolic shape variables `n` and `m` are not something that the runtime can represent (the Relax VM only supports `NDArray` and `ShapeTuple` runtime data structures) → We need to use the heap in the runtime to do shape calculations.
+
+### A**ddress C0: lower `call_tir` to explicit memory allocation form**
+
+An explicit memory form program has the following properties:
+
+- Explicitly allocate and kill storage and tensors
+- Has side effect
+- No shape annotation
+- Core expression: `call(func_name, arg0, arg1, ...) -> optional<Expr>`, this maps to the `Call` instruction that runtime can directly execute.
+
+We can introduce four builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_storage(size, device) -> storage`: Allocate a storage (a contiguous block of memory) that can be used to create tensors.
+- `relax.runtime.builtin.alloc_tensor(storage, shape, offset, dtype) -> tensor`: Allocate a tensor in a storage.
+- `relax.runtime.builtin.free_storage(storage)`: Free the allocated storage.
+- `relax.runtime.builtin.free_tensor(tensor)`: Free the allocated tensor.
+
+Program after call_tir lowering:
+
+```python
+@R.function
+def relax_function(x):
+    # the memory allocation has side effect, so it's now in a BindingBlock instead of a DataflowBlock
+    n, m = R.match_shape(x.shape)
+		
+    storage0 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor0 = relax.runtime.builtin.alloc_tensor(storage0, shape=[n, m], offset=0, "float32")
+    R.call_packed("tirexp"), x, tensor0)
+		
+    storage1 = relax.runtime.builtin.alloc_storage(size=[n*m], device=cpu)
+    tensor1 = relax.runtime.builtin.alloc_tensor(storage1, shape=[m*n,], offset=0, "float32")
+    R.call_packed("flatten"), tensor0, tensor1)
+		
+    R.call_packed("free_tensor"), tensor0)
+    R.call_packed("free_storage"), storage0)
+    return tensor1
+```
+
+In a future RFC, we will design and implement a memory planner to be leveraged both by the Relax VM flow discussed here and the AOT flow to be defined in the future.
+
+### A**ddress C1: do shape lowering via VM heap manipulation**
+
+We can introduce three builtin functions in the runtime:
+
+- `relax.runtime.builtin.alloc_heap(size) -> heap`: Allocate the heap (an NDArray) with a specific size to execute shape computation

Review Comment:
   Suggest to use a different term than "heap", since it is used very differently from the conventional CS sense ("heap" as in stack / heap). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] slyubomirsky commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

slyubomirsky commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r952934825


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.

Review Comment:
   `call_tir` allows a `PrimFunc` to be called from a Relax program. It allocates the destination tensor first and invokes the `PrimFunc` in destination-passing style. It's implemented as an operator in Relax, so it's a specific kind of call. Normal `Call` nodes can only call Relax functions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] sunggg commented on a diff in pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

sunggg commented on code in PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r953154984


##########
rfcs/0089-relax-upstreaming.md:
##########
@@ -0,0 +1,701 @@
+- Feature Name: Relax Upstreaming
+- Start Date: 2022-08-17
+- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm)
+
+# 1. **Summary**
+
+This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers.
+
+# 2. **Motivation and goals**
+
+Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM:
+
+- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc);
+- Enable and optimize dynamic shape workloads;
+- Support “computational graph” style optimizations with advanced dataflow semantics.
+
+For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496).
+
+The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles:
+
+- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow.
+- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code.
+
+This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community.
+
+# 3. **Guide-level explanation**
+
+This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands.
+
+(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.)
+
+## User-facing interface
+
+After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM.
+
+```python
+import tvm.script
+from tvm.script import relax as R, tir as T
+
+# Relax IRModule written in TVMScript
+@tvm.script.ir_module
+class MyIRModule:
+    # This is a TIR PrimFunc which calls the TIR intrinsic T.exp
+    @T.prim_func
+    def tir_exp_func(x: T.handle, y: T.handle): ## <= D2
+        X = T.match_buffer(x, (n,), "float32")
+        Y = T.match_buffer(y, (n,), "float32")
+        with T.grid(n) as i:
+            Y[i] = T.exp(X[i])
+
+    # This is a Relax function which contains a dataflow block
+    # representing a computational graph, as well as a call to an
+    # opaque packed function which performs an in-place update to the
+    # data in variable gv0.
+    # We mark the corresponding design points (D0, D1, D2) that map to
+    # the following sections throughout the relax function bellow.
+    @R.function
+    def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]):
+    # n, k above are implicitly defined within the function signature
+    # so we will be able to refer to n, k within all of relax_func
+        with R.dataflow(): ## <= D2
+            lv0 = R.match_shape(w, (k, m)) ## <= D1
+            lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0)
+            lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1
+            lv3: R.Shape = (n * m,)  ## <= D1
+            gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0
+            R.outputs(gv0)
+
+        R.call_packed("custom_inplace_update", gv0) ## <= D0, D2
+        return gv0
+
+# Print IRModule with syntax highlighting
+MyIRModule.show()
+
+# Build the Relax IRModule
+target = tvm.target.Target("llvm")
+exec = relax.vm.build(MyIRModule, target)
+
+# Dump the VM executable instructions as text
+print(ex.as_text())
+
+# Run the function on Relax VM runtime
+vm = relax.VirtualMachine(exec, tvm.cpu())
+shape = (2, 3)
+data = tvm.nd.array(np.random.rand(*shape).astype(np.float32))
+res = vm["relax_func"](data)
+```
+
+## D0: ****Unified abstractions and optimizations across layers****
+
+The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI).
+
+The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function:
+
+```python
+def low_level_func(input0, input1, ..., output):
+    # implementations
+```
+
+The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation.
+
+### ****call_tir****
+
+In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below.
+
+```python
+def call_tir(tir_primfunc: GlobalVar, 
+             inputs: Tuple[Expr], 
+             output_shape: Shape, 
+             output_dtype: DataType) -> Expr:
+    """Example code to demonstrate the semantics of call_tir"""
+    out_tensor = alloc_tensor(output_shape, output_dtype)
+    low_level_func(*inputs, out_tensor)
+    return out_tensor
+```
+
+`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype.  Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse.
+
+`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes:
+
+```python
+Call(
+    op=Op::Get("relax.call_tir"),   
+    tir_primfunc,
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+### ****call_packed****
+
+In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. 
+
+From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value):
+
+```python
+Call(op=ExternFunc("my_packed_func"), *args)
+```
+
+`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. 
+
+### ****call_dps_packed****
+
+To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),   
+    ExternFunc("my_packed_func"),
+    inputs,
+    output_shape,
+    output_dtype
+)
+```
+
+Suppose `custom_packed_func` is a user-defined packed function in DPS:
+
+```python
+R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32")
+```
+
+corresponds to the following AST:
+
+```python
+Call(
+    op=Op::Get("relax.call_dps_packed"),
+    ExternFunc("custom_packed_func"),
+    (input0, input1),
+    output_shape=(3, 4), 
+    output_dtype="float32"
+)
+```
+
+The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program.
+
+```python
+from tvm.script import relax as R
+
+# User-defined packed functions
+# Non-DPS PackedFunc with return
+@tvm.register_func("custom_add")
+def add_packed(a, b):
+    ret = a.numpy() + b.numpy()
+    return tvm.nd.array(ret)
+
+# Non-DPS PackedFunc without return
+@tvm.register_func("custom_print")
+def print_packed(a):
+    print(a)
+
+# DPS PackedFunc
+@tvm.register_func("custom_tile")
+def tile_packed(a, b):
+    b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2)))
+
+@tvm.script.ir_module
+class MyIRModule:
+    # define a PrimFunc to do matrix multiply
+    # note TIR PrimFunc is in DPS, here z is the output
+    @T.prim_func
+    def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
+        m = T.var("int32")
+        n = T.var("int32")
+        k = T.var("int32")
+        A = T.match_buffer(x, (m, n))
+        B = T.match_buffer(y, (n, k))
+        C = T.match_buffer(z, (m, k))
+
+        for (i0, j0, k0) in T.grid(m, n, k):
+            with T.block():
+                i, j, k = T.axis.remap("SSR", [i0, j0, k0])
+                with T.init():
+                    C[i, j] = 0.0
+                C[i, j] += A[i, k] * B[j, k]
+
+    @R.function
+    def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]):
+        with R.dataflow():
+            # call_tir calls into a PrimFunc, and returns the matrix multiplication result
+            gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32")
+            R.outputs(gv0)
+
+        # call into a PackedFunc to print the value of gv0
+        R.call_packed("custom_print", gv0)
+
+        # call the registered "custom_add" non-DPS PackedFunc and return the result
+        gv1 = R.call_packed("custom_add", gv0, gv0)
+
+        # call the registered "custom_tile" DPS PackedFunc and return the result
+        gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32")
+        return gv2
+```
+
+This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to:
+
+- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today.

Review Comment:
   This seems to related to my reply above: https://github.com/apache/tvm-rfcs/pull/89#discussion_r953154506



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] YuchenJin commented on pull request #89: [RFC] Relax Upstreaming

Posted by GitBox <gi...@apache.org>.

YuchenJin commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1311123122

   Thanks everyone for the discussions! A brief recap of our discussions so far:
   
   - We are certain that Relax supports dynamic-shape workloads that are not supported by the current TVM, which can immediately benefit many community members and users.
   
   - For why Relax should be brought into the project today, we showed that having Relax and Relay co-exist in the codebase is a positive thing in several aspects ([https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1267729342](https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1267729342)). And the path to moving TVM to a Relax-only project will be long, so Relax and Relay co-existing is necessary for the foreseeable future, just like how TorchFX co-exists with TorchScript in the Pytorch project. We acknowledge the concern that Relax can bring confusion to some of the members in terms of which IR to contribute to, but we also encourage the community to consider the fact that Relax can directly bring dynamic-shape compilation to TVM while the original workloads can still be compiled by Relay compilation, and other factors including community empowerment and the scope of this proposed module.
   
   - It’s been pointed out that it would be helpful if we lay out the ideal scenario for how we see Relax and TVM Unity evolving over time in the TVM project. The reason we have built Relax is that we are confident that Relax both in current and future forms will significantly improve TVM, and we have outlined the future opportunities in [https://github.com/tqchen/tvm-rfcs/blob/main/rfcs/0091-establish-tvm-unity-connection.md#4-future-opportunities-after-unity-connection](https://github.com/tqchen/tvm-rfcs/blob/main/rfcs/0091-establish-tvm-unity-connection.md#4-future-opportunities-after-unity-connection). Nevertheless, it is helpful to explain in more detail given our current state of knowledge, so we will bring in the discussions of integration of Relax in TVM default flows and consolidation/deprecation of Relay and Relax as add-ons to the [roadmap](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0069-relax-roadmap.md).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on pull request #89: [RFC] Relax Upstreaming

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1495957710

   An update, thanks to effort of many community members we are now at a period where the initial foundational items in unity branch are now established.
   
   One goal of the G1 is to give some time answer questions, and provide examples to those who have shared needs to have more detailed evidence and possible feasibility analysis of migrating some modules.
   
   As part of the effort, the community members have been posting tutorials on related topics of interest in the [forum](https://discuss.tvm.apache.org/c/development/unity/14). Please post questions and your thoughts about what additional tutorials you want to see and open up further technical discussions.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [RFC] Relax Upstreaming [tvm-rfcs]

Posted by "YuchenJin (via GitHub)" <gi...@apache.org>.

YuchenJin commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1904969432

   > It's worth noting that with the merging of Unity into TVM's main branch, Relax has already been _de facto_ upstreamed.
   
   🥳 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [RFC] Relax Upstreaming [tvm-rfcs]

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen commented on PR #89:
URL: https://github.com/apache/tvm-rfcs/pull/89#issuecomment-1904964309

   indeed, check out https://github.com/apache/tvm/issues/16446


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] [RFC] Relax Upstreaming [tvm-rfcs]

Posted by "tqchen (via GitHub)" <gi...@apache.org>.

tqchen closed pull request #89: [RFC] Relax Upstreaming
URL: https://github.com/apache/tvm-rfcs/pull/89


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org