You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/07/20 22:10:03 UTC
[GitHub] [incubator-tvm] jroesch commented on a change in pull request #6097: [DOCS][REFACTOR] Organize Design and Architectures

jroesch commented on a change in pull request #6097:
URL: https://github.com/apache/incubator-tvm/pull/6097#discussion_r457711532



##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.

Review comment:
       ```suggestion
   architecture of TVM and/or actively develop on the project.
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:

Review comment:
       ```suggestion
   This page is organized as follows:
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the

Review comment:
       ```suggestion
   This document is intended for developers who want to understand the
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.

Review comment:
       I'm not sure what we should use here, but we should come up with a better title imo cc @hodgepodge

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+
+The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we also support the direct translation of a relay function(sub-graph) to external code generators. Importantly, we want to keep the target translation as lightweight as possible and perform most of the lowerings before target translation.
+We also provide a Target structure to specify the compilation target. The transformations before the target translation phase can also be affected by the target — for example, a target's vector length would change the vectorization behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of tvm's runtime is to provide a minimum set of APIs to allow a user to load and execute the compiled artifact in their language of choice, including python, c++, rust, go, java, and javascript. The code snippet below shows such an example in python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for both the generated functions. A runtime.PackedFunc can take arguments and return values with the following types: POD types(int, float), string, runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet below gives an example of an end to end model execution using the same interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs(such as addone), as well as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data structures.
+
+- relay/transform and tir/transform are determinstic rule-based transformations
+- auto_scheduler and autotvm contains the search-based transformations
+
+Finally, the compilation flow example is only a typical use-case of the TVM stack. We expose these key data structures and transformations to python and C++ APIs. As a result, you can use TVM just like the way you use numpy, except that the data structure of interest changes from the numpy.ndarray to tvm.IRModule. Here are some example use-cases:
+
+- Directly construct IRModule using hybrid script for compilations.
+- Compose a custom set of transformations(e.g. customize quantization).
+- Manipulate the IR directly using tvm's python API.
+
+
+
+Logical Architecture Components
+-------------------------------
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_static_overview.svg
+   :align: center
+   :width: 85%
+
+   TVM Architecture Diagram
+
+tvm/support
+-----------
+The support module contains the most common utilities for the infrastructure, such as generic arena allocator, socket, and logging.
+
+
+tvm/runtime
+-----------
+
+The runtime serves as the foundation of the TVM stack. It provides the mechanism to load and execute compiled artifacts. The runtime defines a stable standard set of C API to interface with frontend languages such as python and rust.
+
+`runtime::Object` is one of the primary data structures in TVM runtime besides the `runtime::PackedFunc`. It is a reference-counted base class with type index to support runtime type checking and downcasting. The object system allows the developer to introduce new data structures to the runtime, such as Array, Map, and new IR data structures.
+
+Besides the deployment use-cases, the compiler itself also makes heavy use of the tvm's runtime mechanism. All of the IR data structures are subclasses of `runtime::Object`, as a result, they can be directly accessed and manipulated from the python frontend. We expose the PackedFunc to expose various APIs to the frontend.
+
+Runtime support for different hardware backends are defined in subdirectories of runtime(e.g. runtime/opencl). These hardware-specific runtime modules define APIs for device memory allocation and device function serialization.
+
+`runtime/rpc` implements an RPC support for PackedFunc. We can use the RPC mechanism to send a cross-compiled library to a remote device and benchmark the execution performance. The rpc infrastructure enables data collection from a wide range of hardware backends for learning-based optimizations.
 
-Building a compiler stack for deep learning systems involves many many systems-level design decisions.
-In this part of documentation, we share the rationale for the specific choices made when designing TVM.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    runtime
    debugger
+   virtual_machine
+   introduction_to_module_serialization
+
+tvm/node
+--------
+The node module adds additional features on top of the `runtime::Object` for IR data structures.
+The main features include reflection, serialization, structural equivalence, and hashing.
+
+Thanks to the node module, we can directly access any field of the tvm's IRNode by their name in python.
+
+.. code-block:: python
+
+    x = tvm.tir.Var("x", "int32")
+    y = tvm.tir.Add(x, x)
+    # we can directly use the field name to access the IR structures
+    assert y.a == x
+
+
+We can also directly serialize arbitrary IR node into a JSON format, and load them back.
+The ability to save/store, and inspect an IR node provides a foundation for making the compiler more accessible.
+
+
+tvm/ir
+------
+The `tvm/ir` folder contains the unified data structure and interfaces across for all IR function variants.
+The components in `tvm/ir` are shared by `tvm/relay` and `tvm/tir`, notable ones include
+
+- IRModule
+- Type
+- PassContext and Pass
+- Op
+
+Different variants of functions(e.g. relay.Function and tir.PrimFunc) can co-exist in an IRModule.
+While these variants may not have the same content representation, they use the same data structure to represent types,
+and as a consequence, their function signatures. The unified type system allows one function variant to call into another
+once we clearly define the calling convention and opens doors for future cross-function-variant optimizations.
+
+We also provide a unified PassContext for configuring the pass behavior, and common composite passes to execute a pass pipeline.
+
+Op is the common class to represent all system-defined primitive operator/intrinsics.
+Developers can register new Ops as well as their additional attributes(e.g. whether the Op is elementwise) to the system.
+
+
+tvm/target
+----------
+The target module contains all the code generators that translate an IRModule to a target runtime.Module.
+It also provides a common `Target` class that describes the target.
+
+The compilation pipeline can be customized according to the target by querying the attribute information
+in the target and builtin information registered to each target id(cuda, opencl).
+
+tvm/tir
+-------
+
+TIR contains the definition for the low-level program representations. We use `tir::PrimFunc` to represent functions that can be transformed by TIR passes.
+Besides the IR data structures, the tir module also defines a set of builtin intrinsics and their attributes via the common Op registry, as well as transformation passes in `tir/transform`.
+
+tvm/arith
+---------
+
+This module is closely tied to the TIR. One of the key problems in the low-level code generation is the analysis of the indices'
+arithmetic properties — the positiveness, variable bound, and the integer set that describes the iterator space. arith module provides
+a collection of tools that do (primarily integer) analysis. A TIR pass can use these analyses to simplify and optimize the code.
+
+tvm/te
+------
+
+The name te stands for "tensor expression". This is a domain-specific language module that allows us to construct `tir::PrimFunc` variants quickly by writing tensor expressions.
+Importantly, a tensor expression itself is not a self-contained function that can be stored into IRModule. Instead, it is a fragment of IR that we can stitch together to build an IRModule.
+
+`te/schedule` provides a collection of scheduling primitives to control the function being generated. In the future, we might bring some of
+these scheduling components to the a `tir::PrimFunc` itself.
+
+.. toctree::
+   :maxdepth: 1
+
+   inferbound
    hybrid_script
+
+topi
+----
+While we can build different kinds of operators directly via tir or tensor expression.
+It is still tedious to do so. `topi` provides a set of pre-defined operators (in tensor expression or tir) in

Review comment:
       ```suggestion
   `topi` provides a set of pre-defined operators (in TE or TIR) defined by 
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+
+The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we also support the direct translation of a relay function(sub-graph) to external code generators. Importantly, we want to keep the target translation as lightweight as possible and perform most of the lowerings before target translation.
+We also provide a Target structure to specify the compilation target. The transformations before the target translation phase can also be affected by the target — for example, a target's vector length would change the vectorization behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of tvm's runtime is to provide a minimum set of APIs to allow a user to load and execute the compiled artifact in their language of choice, including python, c++, rust, go, java, and javascript. The code snippet below shows such an example in python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for both the generated functions. A runtime.PackedFunc can take arguments and return values with the following types: POD types(int, float), string, runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet below gives an example of an end to end model execution using the same interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs(such as addone), as well as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data structures.
+
+- relay/transform and tir/transform are determinstic rule-based transformations
+- auto_scheduler and autotvm contains the search-based transformations
+
+Finally, the compilation flow example is only a typical use-case of the TVM stack. We expose these key data structures and transformations to python and C++ APIs. As a result, you can use TVM just like the way you use numpy, except that the data structure of interest changes from the numpy.ndarray to tvm.IRModule. Here are some example use-cases:
+
+- Directly construct IRModule using hybrid script for compilations.
+- Compose a custom set of transformations(e.g. customize quantization).
+- Manipulate the IR directly using tvm's python API.
+
+
+
+Logical Architecture Components
+-------------------------------
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_static_overview.svg
+   :align: center
+   :width: 85%
+
+   TVM Architecture Diagram
+
+tvm/support
+-----------
+The support module contains the most common utilities for the infrastructure, such as generic arena allocator, socket, and logging.
+
+
+tvm/runtime
+-----------
+
+The runtime serves as the foundation of the TVM stack. It provides the mechanism to load and execute compiled artifacts. The runtime defines a stable standard set of C API to interface with frontend languages such as python and rust.
+
+`runtime::Object` is one of the primary data structures in TVM runtime besides the `runtime::PackedFunc`. It is a reference-counted base class with type index to support runtime type checking and downcasting. The object system allows the developer to introduce new data structures to the runtime, such as Array, Map, and new IR data structures.
+
+Besides the deployment use-cases, the compiler itself also makes heavy use of the tvm's runtime mechanism. All of the IR data structures are subclasses of `runtime::Object`, as a result, they can be directly accessed and manipulated from the python frontend. We expose the PackedFunc to expose various APIs to the frontend.
+
+Runtime support for different hardware backends are defined in subdirectories of runtime(e.g. runtime/opencl). These hardware-specific runtime modules define APIs for device memory allocation and device function serialization.
+
+`runtime/rpc` implements an RPC support for PackedFunc. We can use the RPC mechanism to send a cross-compiled library to a remote device and benchmark the execution performance. The rpc infrastructure enables data collection from a wide range of hardware backends for learning-based optimizations.
 
-Building a compiler stack for deep learning systems involves many many systems-level design decisions.
-In this part of documentation, we share the rationale for the specific choices made when designing TVM.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    runtime
    debugger
+   virtual_machine
+   introduction_to_module_serialization
+
+tvm/node
+--------
+The node module adds additional features on top of the `runtime::Object` for IR data structures.
+The main features include reflection, serialization, structural equivalence, and hashing.
+
+Thanks to the node module, we can directly access any field of the tvm's IRNode by their name in python.
+
+.. code-block:: python
+
+    x = tvm.tir.Var("x", "int32")
+    y = tvm.tir.Add(x, x)
+    # we can directly use the field name to access the IR structures
+    assert y.a == x
+
+
+We can also directly serialize arbitrary IR node into a JSON format, and load them back.
+The ability to save/store, and inspect an IR node provides a foundation for making the compiler more accessible.
+
+
+tvm/ir
+------
+The `tvm/ir` folder contains the unified data structure and interfaces across for all IR function variants.
+The components in `tvm/ir` are shared by `tvm/relay` and `tvm/tir`, notable ones include
+
+- IRModule
+- Type
+- PassContext and Pass
+- Op
+
+Different variants of functions(e.g. relay.Function and tir.PrimFunc) can co-exist in an IRModule.
+While these variants may not have the same content representation, they use the same data structure to represent types,
+and as a consequence, their function signatures. The unified type system allows one function variant to call into another
+once we clearly define the calling convention and opens doors for future cross-function-variant optimizations.
+
+We also provide a unified PassContext for configuring the pass behavior, and common composite passes to execute a pass pipeline.
+
+Op is the common class to represent all system-defined primitive operator/intrinsics.
+Developers can register new Ops as well as their additional attributes(e.g. whether the Op is elementwise) to the system.
+
+
+tvm/target
+----------
+The target module contains all the code generators that translate an IRModule to a target runtime.Module.
+It also provides a common `Target` class that describes the target.
+
+The compilation pipeline can be customized according to the target by querying the attribute information
+in the target and builtin information registered to each target id(cuda, opencl).
+
+tvm/tir
+-------
+
+TIR contains the definition for the low-level program representations. We use `tir::PrimFunc` to represent functions that can be transformed by TIR passes.
+Besides the IR data structures, the tir module also defines a set of builtin intrinsics and their attributes via the common Op registry, as well as transformation passes in `tir/transform`.
+
+tvm/arith
+---------
+
+This module is closely tied to the TIR. One of the key problems in the low-level code generation is the analysis of the indices'
+arithmetic properties — the positiveness, variable bound, and the integer set that describes the iterator space. arith module provides
+a collection of tools that do (primarily integer) analysis. A TIR pass can use these analyses to simplify and optimize the code.
+
+tvm/te
+------
+
+The name te stands for "tensor expression". This is a domain-specific language module that allows us to construct `tir::PrimFunc` variants quickly by writing tensor expressions.
+Importantly, a tensor expression itself is not a self-contained function that can be stored into IRModule. Instead, it is a fragment of IR that we can stitch together to build an IRModule.
+
+`te/schedule` provides a collection of scheduling primitives to control the function being generated. In the future, we might bring some of
+these scheduling components to the a `tir::PrimFunc` itself.
+
+.. toctree::
+   :maxdepth: 1
+
+   inferbound
    hybrid_script
+
+topi
+----
+While we can build different kinds of operators directly via tir or tensor expression.
+It is still tedious to do so. `topi` provides a set of pre-defined operators (in tensor expression or tir) in
+numpy and common deep learning workloads. We also provide a collection of common scheduling templates to schedule those operators on different target platforms.

Review comment:
       ```suggestion
   numpy and found in common deep learning workloads. We also provide a collection of common schedule templates to obtain performant implementations across different target platforms.
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+
+The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we also support the direct translation of a relay function(sub-graph) to external code generators. Importantly, we want to keep the target translation as lightweight as possible and perform most of the lowerings before target translation.
+We also provide a Target structure to specify the compilation target. The transformations before the target translation phase can also be affected by the target — for example, a target's vector length would change the vectorization behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of tvm's runtime is to provide a minimum set of APIs to allow a user to load and execute the compiled artifact in their language of choice, including python, c++, rust, go, java, and javascript. The code snippet below shows such an example in python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for both the generated functions. A runtime.PackedFunc can take arguments and return values with the following types: POD types(int, float), string, runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet below gives an example of an end to end model execution using the same interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs(such as addone), as well as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data structures.
+
+- relay/transform and tir/transform are determinstic rule-based transformations
+- auto_scheduler and autotvm contains the search-based transformations
+
+Finally, the compilation flow example is only a typical use-case of the TVM stack. We expose these key data structures and transformations to python and C++ APIs. As a result, you can use TVM just like the way you use numpy, except that the data structure of interest changes from the numpy.ndarray to tvm.IRModule. Here are some example use-cases:
+
+- Directly construct IRModule using hybrid script for compilations.
+- Compose a custom set of transformations(e.g. customize quantization).
+- Manipulate the IR directly using tvm's python API.
+
+
+
+Logical Architecture Components
+-------------------------------
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_static_overview.svg
+   :align: center
+   :width: 85%
+
+   TVM Architecture Diagram
+
+tvm/support
+-----------
+The support module contains the most common utilities for the infrastructure, such as generic arena allocator, socket, and logging.
+
+
+tvm/runtime
+-----------
+
+The runtime serves as the foundation of the TVM stack. It provides the mechanism to load and execute compiled artifacts. The runtime defines a stable standard set of C API to interface with frontend languages such as python and rust.
+
+`runtime::Object` is one of the primary data structures in TVM runtime besides the `runtime::PackedFunc`. It is a reference-counted base class with type index to support runtime type checking and downcasting. The object system allows the developer to introduce new data structures to the runtime, such as Array, Map, and new IR data structures.
+
+Besides the deployment use-cases, the compiler itself also makes heavy use of the tvm's runtime mechanism. All of the IR data structures are subclasses of `runtime::Object`, as a result, they can be directly accessed and manipulated from the python frontend. We expose the PackedFunc to expose various APIs to the frontend.
+
+Runtime support for different hardware backends are defined in subdirectories of runtime(e.g. runtime/opencl). These hardware-specific runtime modules define APIs for device memory allocation and device function serialization.
+
+`runtime/rpc` implements an RPC support for PackedFunc. We can use the RPC mechanism to send a cross-compiled library to a remote device and benchmark the execution performance. The rpc infrastructure enables data collection from a wide range of hardware backends for learning-based optimizations.
 
-Building a compiler stack for deep learning systems involves many many systems-level design decisions.
-In this part of documentation, we share the rationale for the specific choices made when designing TVM.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    runtime
    debugger
+   virtual_machine
+   introduction_to_module_serialization
+
+tvm/node
+--------
+The node module adds additional features on top of the `runtime::Object` for IR data structures.
+The main features include reflection, serialization, structural equivalence, and hashing.
+
+Thanks to the node module, we can directly access any field of the tvm's IRNode by their name in python.
+
+.. code-block:: python
+
+    x = tvm.tir.Var("x", "int32")
+    y = tvm.tir.Add(x, x)
+    # we can directly use the field name to access the IR structures
+    assert y.a == x
+
+
+We can also directly serialize arbitrary IR node into a JSON format, and load them back.
+The ability to save/store, and inspect an IR node provides a foundation for making the compiler more accessible.
+
+
+tvm/ir
+------
+The `tvm/ir` folder contains the unified data structure and interfaces across for all IR function variants.
+The components in `tvm/ir` are shared by `tvm/relay` and `tvm/tir`, notable ones include
+
+- IRModule
+- Type
+- PassContext and Pass
+- Op
+
+Different variants of functions(e.g. relay.Function and tir.PrimFunc) can co-exist in an IRModule.
+While these variants may not have the same content representation, they use the same data structure to represent types,
+and as a consequence, their function signatures. The unified type system allows one function variant to call into another
+once we clearly define the calling convention and opens doors for future cross-function-variant optimizations.
+
+We also provide a unified PassContext for configuring the pass behavior, and common composite passes to execute a pass pipeline.
+
+Op is the common class to represent all system-defined primitive operator/intrinsics.
+Developers can register new Ops as well as their additional attributes(e.g. whether the Op is elementwise) to the system.
+
+
+tvm/target
+----------
+The target module contains all the code generators that translate an IRModule to a target runtime.Module.
+It also provides a common `Target` class that describes the target.
+
+The compilation pipeline can be customized according to the target by querying the attribute information
+in the target and builtin information registered to each target id(cuda, opencl).
+
+tvm/tir
+-------
+
+TIR contains the definition for the low-level program representations. We use `tir::PrimFunc` to represent functions that can be transformed by TIR passes.
+Besides the IR data structures, the tir module also defines a set of builtin intrinsics and their attributes via the common Op registry, as well as transformation passes in `tir/transform`.
+
+tvm/arith
+---------
+
+This module is closely tied to the TIR. One of the key problems in the low-level code generation is the analysis of the indices'
+arithmetic properties — the positiveness, variable bound, and the integer set that describes the iterator space. arith module provides
+a collection of tools that do (primarily integer) analysis. A TIR pass can use these analyses to simplify and optimize the code.
+
+tvm/te
+------
+
+The name te stands for "tensor expression". This is a domain-specific language module that allows us to construct `tir::PrimFunc` variants quickly by writing tensor expressions.
+Importantly, a tensor expression itself is not a self-contained function that can be stored into IRModule. Instead, it is a fragment of IR that we can stitch together to build an IRModule.
+
+`te/schedule` provides a collection of scheduling primitives to control the function being generated. In the future, we might bring some of
+these scheduling components to the a `tir::PrimFunc` itself.
+
+.. toctree::
+   :maxdepth: 1
+
+   inferbound
    hybrid_script
+
+topi
+----
+While we can build different kinds of operators directly via tir or tensor expression.
+It is still tedious to do so. `topi` provides a set of pre-defined operators (in tensor expression or tir) in
+numpy and common deep learning workloads. We also provide a collection of common scheduling templates to schedule those operators on different target platforms.
+
+
+tvm/relay
+---------
+Relay is the high-level functional IR used to represent the end to end models. Various optimizations are supported in relay/transform. There are multiple dialects in relay to support specific perspectives of high-level optimization. Notably ones include qnn(for importing pre-quantized models), vm(for lowering to dynamic vm), memory(for memory optimization).

Review comment:
       ```suggestion
   Relay is the high-level functional IR used to represent full models. Various optimizations are defined in `relay.transform`. The Relay compiler defines multiple dialects, each dialect is designed to support specific styles of optimization. Notable ones include QNN(for importing pre-quantized models), VM(for lowering to dynamic vm), memory(for memory optimization).
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+

Review comment:
       ```suggestion
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.

Review comment:
       ```suggestion
   First, we review a single end to end compilation flow and discuss the key data structures and the transformations.
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+
+The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we also support the direct translation of a relay function(sub-graph) to external code generators. Importantly, we want to keep the target translation as lightweight as possible and perform most of the lowerings before target translation.

Review comment:
       ```suggestion
   The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we support the direct translation of a Relay function (sub-graph) for external code generators. Importantly, the final code generation phase should be lightweight as possible with the vast majority of transformations and lowering performed before target translation.
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.

Review comment:
       ```suggestion
   Then we will review the logical modules of the codebase and their relationship. This part provides a static overarching view of the design.
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.

Review comment:
       ```suggestion
   - The `Example Compilation Flow Walkthrough`_ section is a walk through of a typical compilation flow explaining each component used during compilation.
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.

Review comment:
       ```suggestion
   - Import: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+
+The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we also support the direct translation of a relay function(sub-graph) to external code generators. Importantly, we want to keep the target translation as lightweight as possible and perform most of the lowerings before target translation.
+We also provide a Target structure to specify the compilation target. The transformations before the target translation phase can also be affected by the target — for example, a target's vector length would change the vectorization behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of tvm's runtime is to provide a minimum set of APIs to allow a user to load and execute the compiled artifact in their language of choice, including python, c++, rust, go, java, and javascript. The code snippet below shows such an example in python:

Review comment:
       ```suggestion
   The main goal of TVM's runtime is to provide a minimal API for loading and executing the compiled artifact in a language of their choice, including Python, C++, Rust, Go, Java, and JavaScript. The code snippet below shows such an example in Python:
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized

Review comment:
       ```suggestion
     The sections after are specific guides focused on each logical component, organized
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.

Review comment:
       ```suggestion
   This runtime-based view focuses on the interactions of each components when running the compiler.
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+
+The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we also support the direct translation of a relay function(sub-graph) to external code generators. Importantly, we want to keep the target translation as lightweight as possible and perform most of the lowerings before target translation.
+We also provide a Target structure to specify the compilation target. The transformations before the target translation phase can also be affected by the target — for example, a target's vector length would change the vectorization behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of tvm's runtime is to provide a minimum set of APIs to allow a user to load and execute the compiled artifact in their language of choice, including python, c++, rust, go, java, and javascript. The code snippet below shows such an example in python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for both the generated functions. A runtime.PackedFunc can take arguments and return values with the following types: POD types(int, float), string, runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet below gives an example of an end to end model execution using the same interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs(such as addone), as well as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data structures.
+
+- relay/transform and tir/transform are determinstic rule-based transformations
+- auto_scheduler and autotvm contains the search-based transformations
+
+Finally, the compilation flow example is only a typical use-case of the TVM stack. We expose these key data structures and transformations to python and C++ APIs. As a result, you can use TVM just like the way you use numpy, except that the data structure of interest changes from the numpy.ndarray to tvm.IRModule. Here are some example use-cases:
+
+- Directly construct IRModule using hybrid script for compilations.
+- Compose a custom set of transformations(e.g. customize quantization).
+- Manipulate the IR directly using tvm's python API.
+
+
+
+Logical Architecture Components
+-------------------------------
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_static_overview.svg
+   :align: center
+   :width: 85%
+
+   TVM Architecture Diagram
+
+tvm/support
+-----------
+The support module contains the most common utilities for the infrastructure, such as generic arena allocator, socket, and logging.
+
+
+tvm/runtime
+-----------
+
+The runtime serves as the foundation of the TVM stack. It provides the mechanism to load and execute compiled artifacts. The runtime defines a stable standard set of C API to interface with frontend languages such as python and rust.
+
+`runtime::Object` is one of the primary data structures in TVM runtime besides the `runtime::PackedFunc`. It is a reference-counted base class with type index to support runtime type checking and downcasting. The object system allows the developer to introduce new data structures to the runtime, such as Array, Map, and new IR data structures.
+
+Besides the deployment use-cases, the compiler itself also makes heavy use of the tvm's runtime mechanism. All of the IR data structures are subclasses of `runtime::Object`, as a result, they can be directly accessed and manipulated from the python frontend. We expose the PackedFunc to expose various APIs to the frontend.
+
+Runtime support for different hardware backends are defined in subdirectories of runtime(e.g. runtime/opencl). These hardware-specific runtime modules define APIs for device memory allocation and device function serialization.
+
+`runtime/rpc` implements an RPC support for PackedFunc. We can use the RPC mechanism to send a cross-compiled library to a remote device and benchmark the execution performance. The rpc infrastructure enables data collection from a wide range of hardware backends for learning-based optimizations.
 
-Building a compiler stack for deep learning systems involves many many systems-level design decisions.
-In this part of documentation, we share the rationale for the specific choices made when designing TVM.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    runtime
    debugger
+   virtual_machine
+   introduction_to_module_serialization
+
+tvm/node
+--------
+The node module adds additional features on top of the `runtime::Object` for IR data structures.
+The main features include reflection, serialization, structural equivalence, and hashing.
+
+Thanks to the node module, we can directly access any field of the tvm's IRNode by their name in python.
+
+.. code-block:: python
+
+    x = tvm.tir.Var("x", "int32")
+    y = tvm.tir.Add(x, x)
+    # we can directly use the field name to access the IR structures
+    assert y.a == x
+
+
+We can also directly serialize arbitrary IR node into a JSON format, and load them back.
+The ability to save/store, and inspect an IR node provides a foundation for making the compiler more accessible.
+
+
+tvm/ir
+------
+The `tvm/ir` folder contains the unified data structure and interfaces across for all IR function variants.
+The components in `tvm/ir` are shared by `tvm/relay` and `tvm/tir`, notable ones include
+
+- IRModule
+- Type
+- PassContext and Pass
+- Op
+
+Different variants of functions(e.g. relay.Function and tir.PrimFunc) can co-exist in an IRModule.
+While these variants may not have the same content representation, they use the same data structure to represent types,
+and as a consequence, their function signatures. The unified type system allows one function variant to call into another
+once we clearly define the calling convention and opens doors for future cross-function-variant optimizations.
+
+We also provide a unified PassContext for configuring the pass behavior, and common composite passes to execute a pass pipeline.
+
+Op is the common class to represent all system-defined primitive operator/intrinsics.
+Developers can register new Ops as well as their additional attributes(e.g. whether the Op is elementwise) to the system.
+
+
+tvm/target
+----------
+The target module contains all the code generators that translate an IRModule to a target runtime.Module.
+It also provides a common `Target` class that describes the target.
+
+The compilation pipeline can be customized according to the target by querying the attribute information
+in the target and builtin information registered to each target id(cuda, opencl).
+
+tvm/tir
+-------
+
+TIR contains the definition for the low-level program representations. We use `tir::PrimFunc` to represent functions that can be transformed by TIR passes.
+Besides the IR data structures, the tir module also defines a set of builtin intrinsics and their attributes via the common Op registry, as well as transformation passes in `tir/transform`.
+
+tvm/arith
+---------
+
+This module is closely tied to the TIR. One of the key problems in the low-level code generation is the analysis of the indices'
+arithmetic properties — the positiveness, variable bound, and the integer set that describes the iterator space. arith module provides
+a collection of tools that do (primarily integer) analysis. A TIR pass can use these analyses to simplify and optimize the code.
+
+tvm/te
+------
+
+The name te stands for "tensor expression". This is a domain-specific language module that allows us to construct `tir::PrimFunc` variants quickly by writing tensor expressions.
+Importantly, a tensor expression itself is not a self-contained function that can be stored into IRModule. Instead, it is a fragment of IR that we can stitch together to build an IRModule.
+
+`te/schedule` provides a collection of scheduling primitives to control the function being generated. In the future, we might bring some of
+these scheduling components to the a `tir::PrimFunc` itself.
+
+.. toctree::
+   :maxdepth: 1
+
+   inferbound
    hybrid_script
+
+topi
+----
+While we can build different kinds of operators directly via tir or tensor expression.
+It is still tedious to do so. `topi` provides a set of pre-defined operators (in tensor expression or tir) in
+numpy and common deep learning workloads. We also provide a collection of common scheduling templates to schedule those operators on different target platforms.
+
+
+tvm/relay
+---------
+Relay is the high-level functional IR used to represent the end to end models. Various optimizations are supported in relay/transform. There are multiple dialects in relay to support specific perspectives of high-level optimization. Notably ones include qnn(for importing pre-quantized models), vm(for lowering to dynamic vm), memory(for memory optimization).
+
+.. toctree::
+   :maxdepth: 1
+
    relay_intro
-   relay_add_op
    relay_op_strategy
    relay_pass_infra
+   convert_layout
+
+
+tvm/autotvm
+-----------
+
+AutoTVM and Autoscheduler are components for automating the search based transformations. This part of the codebase is fast evolving with the following components:
+
+- Cost model and feature extraction.

Review comment:
       ```suggestion
   - Cost models and feature extraction.
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+
+The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we also support the direct translation of a relay function(sub-graph) to external code generators. Importantly, we want to keep the target translation as lightweight as possible and perform most of the lowerings before target translation.
+We also provide a Target structure to specify the compilation target. The transformations before the target translation phase can also be affected by the target — for example, a target's vector length would change the vectorization behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of tvm's runtime is to provide a minimum set of APIs to allow a user to load and execute the compiled artifact in their language of choice, including python, c++, rust, go, java, and javascript. The code snippet below shows such an example in python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for both the generated functions. A runtime.PackedFunc can take arguments and return values with the following types: POD types(int, float), string, runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet below gives an example of an end to end model execution using the same interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs(such as addone), as well as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data structures.
+
+- relay/transform and tir/transform are determinstic rule-based transformations
+- auto_scheduler and autotvm contains the search-based transformations
+
+Finally, the compilation flow example is only a typical use-case of the TVM stack. We expose these key data structures and transformations to python and C++ APIs. As a result, you can use TVM just like the way you use numpy, except that the data structure of interest changes from the numpy.ndarray to tvm.IRModule. Here are some example use-cases:
+
+- Directly construct IRModule using hybrid script for compilations.
+- Compose a custom set of transformations(e.g. customize quantization).
+- Manipulate the IR directly using tvm's python API.
+
+
+
+Logical Architecture Components
+-------------------------------
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_static_overview.svg
+   :align: center
+   :width: 85%
+
+   TVM Architecture Diagram
+
+tvm/support
+-----------
+The support module contains the most common utilities for the infrastructure, such as generic arena allocator, socket, and logging.
+
+
+tvm/runtime
+-----------
+
+The runtime serves as the foundation of the TVM stack. It provides the mechanism to load and execute compiled artifacts. The runtime defines a stable standard set of C API to interface with frontend languages such as python and rust.
+
+`runtime::Object` is one of the primary data structures in TVM runtime besides the `runtime::PackedFunc`. It is a reference-counted base class with type index to support runtime type checking and downcasting. The object system allows the developer to introduce new data structures to the runtime, such as Array, Map, and new IR data structures.
+
+Besides the deployment use-cases, the compiler itself also makes heavy use of the tvm's runtime mechanism. All of the IR data structures are subclasses of `runtime::Object`, as a result, they can be directly accessed and manipulated from the python frontend. We expose the PackedFunc to expose various APIs to the frontend.
+
+Runtime support for different hardware backends are defined in subdirectories of runtime(e.g. runtime/opencl). These hardware-specific runtime modules define APIs for device memory allocation and device function serialization.
+
+`runtime/rpc` implements an RPC support for PackedFunc. We can use the RPC mechanism to send a cross-compiled library to a remote device and benchmark the execution performance. The rpc infrastructure enables data collection from a wide range of hardware backends for learning-based optimizations.
 
-Building a compiler stack for deep learning systems involves many many systems-level design decisions.
-In this part of documentation, we share the rationale for the specific choices made when designing TVM.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    runtime
    debugger
+   virtual_machine
+   introduction_to_module_serialization
+
+tvm/node
+--------
+The node module adds additional features on top of the `runtime::Object` for IR data structures.
+The main features include reflection, serialization, structural equivalence, and hashing.
+
+Thanks to the node module, we can directly access any field of the tvm's IRNode by their name in python.
+
+.. code-block:: python
+
+    x = tvm.tir.Var("x", "int32")
+    y = tvm.tir.Add(x, x)
+    # we can directly use the field name to access the IR structures
+    assert y.a == x
+
+
+We can also directly serialize arbitrary IR node into a JSON format, and load them back.
+The ability to save/store, and inspect an IR node provides a foundation for making the compiler more accessible.
+
+
+tvm/ir
+------
+The `tvm/ir` folder contains the unified data structure and interfaces across for all IR function variants.
+The components in `tvm/ir` are shared by `tvm/relay` and `tvm/tir`, notable ones include
+
+- IRModule
+- Type
+- PassContext and Pass
+- Op
+
+Different variants of functions(e.g. relay.Function and tir.PrimFunc) can co-exist in an IRModule.
+While these variants may not have the same content representation, they use the same data structure to represent types,
+and as a consequence, their function signatures. The unified type system allows one function variant to call into another
+once we clearly define the calling convention and opens doors for future cross-function-variant optimizations.
+
+We also provide a unified PassContext for configuring the pass behavior, and common composite passes to execute a pass pipeline.
+
+Op is the common class to represent all system-defined primitive operator/intrinsics.
+Developers can register new Ops as well as their additional attributes(e.g. whether the Op is elementwise) to the system.
+
+
+tvm/target
+----------
+The target module contains all the code generators that translate an IRModule to a target runtime.Module.
+It also provides a common `Target` class that describes the target.
+
+The compilation pipeline can be customized according to the target by querying the attribute information
+in the target and builtin information registered to each target id(cuda, opencl).
+
+tvm/tir
+-------
+
+TIR contains the definition for the low-level program representations. We use `tir::PrimFunc` to represent functions that can be transformed by TIR passes.
+Besides the IR data structures, the tir module also defines a set of builtin intrinsics and their attributes via the common Op registry, as well as transformation passes in `tir/transform`.
+
+tvm/arith
+---------
+
+This module is closely tied to the TIR. One of the key problems in the low-level code generation is the analysis of the indices'
+arithmetic properties — the positiveness, variable bound, and the integer set that describes the iterator space. arith module provides
+a collection of tools that do (primarily integer) analysis. A TIR pass can use these analyses to simplify and optimize the code.
+
+tvm/te
+------
+
+The name te stands for "tensor expression". This is a domain-specific language module that allows us to construct `tir::PrimFunc` variants quickly by writing tensor expressions.
+Importantly, a tensor expression itself is not a self-contained function that can be stored into IRModule. Instead, it is a fragment of IR that we can stitch together to build an IRModule.
+
+`te/schedule` provides a collection of scheduling primitives to control the function being generated. In the future, we might bring some of
+these scheduling components to the a `tir::PrimFunc` itself.
+
+.. toctree::
+   :maxdepth: 1
+
+   inferbound
    hybrid_script
+
+topi
+----
+While we can build different kinds of operators directly via tir or tensor expression.
+It is still tedious to do so. `topi` provides a set of pre-defined operators (in tensor expression or tir) in
+numpy and common deep learning workloads. We also provide a collection of common scheduling templates to schedule those operators on different target platforms.
+
+
+tvm/relay
+---------
+Relay is the high-level functional IR used to represent the end to end models. Various optimizations are supported in relay/transform. There are multiple dialects in relay to support specific perspectives of high-level optimization. Notably ones include qnn(for importing pre-quantized models), vm(for lowering to dynamic vm), memory(for memory optimization).
+
+.. toctree::
+   :maxdepth: 1
+
    relay_intro
-   relay_add_op
    relay_op_strategy
    relay_pass_infra
+   convert_layout
+
+
+tvm/autotvm
+-----------
+
+AutoTVM and Autoscheduler are components for automating the search based transformations. This part of the codebase is fast evolving with the following components:
+
+- Cost model and feature extraction.
+- Logging format to store the benchmark result.

Review comment:
       ```suggestion
   - A record format for storing program benchmark results for cost model construction. 
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+
+The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we also support the direct translation of a relay function(sub-graph) to external code generators. Importantly, we want to keep the target translation as lightweight as possible and perform most of the lowerings before target translation.
+We also provide a Target structure to specify the compilation target. The transformations before the target translation phase can also be affected by the target — for example, a target's vector length would change the vectorization behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of tvm's runtime is to provide a minimum set of APIs to allow a user to load and execute the compiled artifact in their language of choice, including python, c++, rust, go, java, and javascript. The code snippet below shows such an example in python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for both the generated functions. A runtime.PackedFunc can take arguments and return values with the following types: POD types(int, float), string, runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet below gives an example of an end to end model execution using the same interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs(such as addone), as well as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data structures.
+
+- relay/transform and tir/transform are determinstic rule-based transformations
+- auto_scheduler and autotvm contains the search-based transformations
+
+Finally, the compilation flow example is only a typical use-case of the TVM stack. We expose these key data structures and transformations to python and C++ APIs. As a result, you can use TVM just like the way you use numpy, except that the data structure of interest changes from the numpy.ndarray to tvm.IRModule. Here are some example use-cases:
+
+- Directly construct IRModule using hybrid script for compilations.
+- Compose a custom set of transformations(e.g. customize quantization).
+- Manipulate the IR directly using tvm's python API.
+
+
+
+Logical Architecture Components
+-------------------------------
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_static_overview.svg
+   :align: center
+   :width: 85%
+
+   TVM Architecture Diagram
+
+tvm/support
+-----------
+The support module contains the most common utilities for the infrastructure, such as generic arena allocator, socket, and logging.
+
+
+tvm/runtime
+-----------
+
+The runtime serves as the foundation of the TVM stack. It provides the mechanism to load and execute compiled artifacts. The runtime defines a stable standard set of C API to interface with frontend languages such as python and rust.
+
+`runtime::Object` is one of the primary data structures in TVM runtime besides the `runtime::PackedFunc`. It is a reference-counted base class with type index to support runtime type checking and downcasting. The object system allows the developer to introduce new data structures to the runtime, such as Array, Map, and new IR data structures.
+
+Besides the deployment use-cases, the compiler itself also makes heavy use of the tvm's runtime mechanism. All of the IR data structures are subclasses of `runtime::Object`, as a result, they can be directly accessed and manipulated from the python frontend. We expose the PackedFunc to expose various APIs to the frontend.
+
+Runtime support for different hardware backends are defined in subdirectories of runtime(e.g. runtime/opencl). These hardware-specific runtime modules define APIs for device memory allocation and device function serialization.
+
+`runtime/rpc` implements an RPC support for PackedFunc. We can use the RPC mechanism to send a cross-compiled library to a remote device and benchmark the execution performance. The rpc infrastructure enables data collection from a wide range of hardware backends for learning-based optimizations.
 
-Building a compiler stack for deep learning systems involves many many systems-level design decisions.
-In this part of documentation, we share the rationale for the specific choices made when designing TVM.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    runtime
    debugger
+   virtual_machine
+   introduction_to_module_serialization
+
+tvm/node
+--------
+The node module adds additional features on top of the `runtime::Object` for IR data structures.
+The main features include reflection, serialization, structural equivalence, and hashing.
+
+Thanks to the node module, we can directly access any field of the tvm's IRNode by their name in python.
+
+.. code-block:: python
+
+    x = tvm.tir.Var("x", "int32")
+    y = tvm.tir.Add(x, x)
+    # we can directly use the field name to access the IR structures
+    assert y.a == x
+
+
+We can also directly serialize arbitrary IR node into a JSON format, and load them back.
+The ability to save/store, and inspect an IR node provides a foundation for making the compiler more accessible.
+
+
+tvm/ir
+------
+The `tvm/ir` folder contains the unified data structure and interfaces across for all IR function variants.
+The components in `tvm/ir` are shared by `tvm/relay` and `tvm/tir`, notable ones include
+
+- IRModule
+- Type
+- PassContext and Pass
+- Op
+
+Different variants of functions(e.g. relay.Function and tir.PrimFunc) can co-exist in an IRModule.
+While these variants may not have the same content representation, they use the same data structure to represent types,
+and as a consequence, their function signatures. The unified type system allows one function variant to call into another
+once we clearly define the calling convention and opens doors for future cross-function-variant optimizations.
+
+We also provide a unified PassContext for configuring the pass behavior, and common composite passes to execute a pass pipeline.
+
+Op is the common class to represent all system-defined primitive operator/intrinsics.
+Developers can register new Ops as well as their additional attributes(e.g. whether the Op is elementwise) to the system.
+
+
+tvm/target
+----------
+The target module contains all the code generators that translate an IRModule to a target runtime.Module.
+It also provides a common `Target` class that describes the target.
+
+The compilation pipeline can be customized according to the target by querying the attribute information
+in the target and builtin information registered to each target id(cuda, opencl).
+
+tvm/tir
+-------
+
+TIR contains the definition for the low-level program representations. We use `tir::PrimFunc` to represent functions that can be transformed by TIR passes.
+Besides the IR data structures, the tir module also defines a set of builtin intrinsics and their attributes via the common Op registry, as well as transformation passes in `tir/transform`.
+
+tvm/arith
+---------
+
+This module is closely tied to the TIR. One of the key problems in the low-level code generation is the analysis of the indices'
+arithmetic properties — the positiveness, variable bound, and the integer set that describes the iterator space. arith module provides
+a collection of tools that do (primarily integer) analysis. A TIR pass can use these analyses to simplify and optimize the code.
+
+tvm/te
+------
+
+The name te stands for "tensor expression". This is a domain-specific language module that allows us to construct `tir::PrimFunc` variants quickly by writing tensor expressions.
+Importantly, a tensor expression itself is not a self-contained function that can be stored into IRModule. Instead, it is a fragment of IR that we can stitch together to build an IRModule.
+
+`te/schedule` provides a collection of scheduling primitives to control the function being generated. In the future, we might bring some of
+these scheduling components to the a `tir::PrimFunc` itself.
+
+.. toctree::
+   :maxdepth: 1
+
+   inferbound
    hybrid_script
+
+topi
+----
+While we can build different kinds of operators directly via tir or tensor expression.
+It is still tedious to do so. `topi` provides a set of pre-defined operators (in tensor expression or tir) in
+numpy and common deep learning workloads. We also provide a collection of common scheduling templates to schedule those operators on different target platforms.
+
+
+tvm/relay
+---------
+Relay is the high-level functional IR used to represent the end to end models. Various optimizations are supported in relay/transform. There are multiple dialects in relay to support specific perspectives of high-level optimization. Notably ones include qnn(for importing pre-quantized models), vm(for lowering to dynamic vm), memory(for memory optimization).
+
+.. toctree::
+   :maxdepth: 1
+
    relay_intro
-   relay_add_op
    relay_op_strategy
    relay_pass_infra
+   convert_layout
+
+
+tvm/autotvm
+-----------
+
+AutoTVM and Autoscheduler are components for automating the search based transformations. This part of the codebase is fast evolving with the following components:

Review comment:
       ```suggestion
   AutoTVM and AutoScheduler are both components which automate search based program optimization. This is rapidly evolving and primarily consists of: 
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+
+The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we also support the direct translation of a relay function(sub-graph) to external code generators. Importantly, we want to keep the target translation as lightweight as possible and perform most of the lowerings before target translation.
+We also provide a Target structure to specify the compilation target. The transformations before the target translation phase can also be affected by the target — for example, a target's vector length would change the vectorization behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of tvm's runtime is to provide a minimum set of APIs to allow a user to load and execute the compiled artifact in their language of choice, including python, c++, rust, go, java, and javascript. The code snippet below shows such an example in python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for both the generated functions. A runtime.PackedFunc can take arguments and return values with the following types: POD types(int, float), string, runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet below gives an example of an end to end model execution using the same interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs(such as addone), as well as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data structures.
+
+- relay/transform and tir/transform are determinstic rule-based transformations
+- auto_scheduler and autotvm contains the search-based transformations
+
+Finally, the compilation flow example is only a typical use-case of the TVM stack. We expose these key data structures and transformations to python and C++ APIs. As a result, you can use TVM just like the way you use numpy, except that the data structure of interest changes from the numpy.ndarray to tvm.IRModule. Here are some example use-cases:
+
+- Directly construct IRModule using hybrid script for compilations.
+- Compose a custom set of transformations(e.g. customize quantization).
+- Manipulate the IR directly using tvm's python API.
+
+
+
+Logical Architecture Components
+-------------------------------
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_static_overview.svg
+   :align: center
+   :width: 85%
+
+   TVM Architecture Diagram
+
+tvm/support
+-----------
+The support module contains the most common utilities for the infrastructure, such as generic arena allocator, socket, and logging.
+
+
+tvm/runtime
+-----------
+
+The runtime serves as the foundation of the TVM stack. It provides the mechanism to load and execute compiled artifacts. The runtime defines a stable standard set of C API to interface with frontend languages such as python and rust.
+
+`runtime::Object` is one of the primary data structures in TVM runtime besides the `runtime::PackedFunc`. It is a reference-counted base class with type index to support runtime type checking and downcasting. The object system allows the developer to introduce new data structures to the runtime, such as Array, Map, and new IR data structures.
+
+Besides the deployment use-cases, the compiler itself also makes heavy use of the tvm's runtime mechanism. All of the IR data structures are subclasses of `runtime::Object`, as a result, they can be directly accessed and manipulated from the python frontend. We expose the PackedFunc to expose various APIs to the frontend.
+
+Runtime support for different hardware backends are defined in subdirectories of runtime(e.g. runtime/opencl). These hardware-specific runtime modules define APIs for device memory allocation and device function serialization.
+
+`runtime/rpc` implements an RPC support for PackedFunc. We can use the RPC mechanism to send a cross-compiled library to a remote device and benchmark the execution performance. The rpc infrastructure enables data collection from a wide range of hardware backends for learning-based optimizations.
 
-Building a compiler stack for deep learning systems involves many many systems-level design decisions.
-In this part of documentation, we share the rationale for the specific choices made when designing TVM.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    runtime
    debugger
+   virtual_machine
+   introduction_to_module_serialization
+
+tvm/node
+--------
+The node module adds additional features on top of the `runtime::Object` for IR data structures.
+The main features include reflection, serialization, structural equivalence, and hashing.
+
+Thanks to the node module, we can directly access any field of the tvm's IRNode by their name in python.
+
+.. code-block:: python
+
+    x = tvm.tir.Var("x", "int32")
+    y = tvm.tir.Add(x, x)
+    # we can directly use the field name to access the IR structures
+    assert y.a == x
+
+
+We can also directly serialize arbitrary IR node into a JSON format, and load them back.
+The ability to save/store, and inspect an IR node provides a foundation for making the compiler more accessible.
+
+
+tvm/ir
+------
+The `tvm/ir` folder contains the unified data structure and interfaces across for all IR function variants.
+The components in `tvm/ir` are shared by `tvm/relay` and `tvm/tir`, notable ones include
+
+- IRModule
+- Type
+- PassContext and Pass
+- Op
+
+Different variants of functions(e.g. relay.Function and tir.PrimFunc) can co-exist in an IRModule.
+While these variants may not have the same content representation, they use the same data structure to represent types,
+and as a consequence, their function signatures. The unified type system allows one function variant to call into another
+once we clearly define the calling convention and opens doors for future cross-function-variant optimizations.
+
+We also provide a unified PassContext for configuring the pass behavior, and common composite passes to execute a pass pipeline.
+
+Op is the common class to represent all system-defined primitive operator/intrinsics.
+Developers can register new Ops as well as their additional attributes(e.g. whether the Op is elementwise) to the system.
+
+
+tvm/target
+----------
+The target module contains all the code generators that translate an IRModule to a target runtime.Module.
+It also provides a common `Target` class that describes the target.
+
+The compilation pipeline can be customized according to the target by querying the attribute information
+in the target and builtin information registered to each target id(cuda, opencl).
+
+tvm/tir
+-------
+
+TIR contains the definition for the low-level program representations. We use `tir::PrimFunc` to represent functions that can be transformed by TIR passes.
+Besides the IR data structures, the tir module also defines a set of builtin intrinsics and their attributes via the common Op registry, as well as transformation passes in `tir/transform`.
+
+tvm/arith
+---------
+
+This module is closely tied to the TIR. One of the key problems in the low-level code generation is the analysis of the indices'
+arithmetic properties — the positiveness, variable bound, and the integer set that describes the iterator space. arith module provides
+a collection of tools that do (primarily integer) analysis. A TIR pass can use these analyses to simplify and optimize the code.
+
+tvm/te
+------
+
+The name te stands for "tensor expression". This is a domain-specific language module that allows us to construct `tir::PrimFunc` variants quickly by writing tensor expressions.
+Importantly, a tensor expression itself is not a self-contained function that can be stored into IRModule. Instead, it is a fragment of IR that we can stitch together to build an IRModule.
+
+`te/schedule` provides a collection of scheduling primitives to control the function being generated. In the future, we might bring some of
+these scheduling components to the a `tir::PrimFunc` itself.
+
+.. toctree::
+   :maxdepth: 1
+
+   inferbound
    hybrid_script
+
+topi
+----
+While we can build different kinds of operators directly via tir or tensor expression.
+It is still tedious to do so. `topi` provides a set of pre-defined operators (in tensor expression or tir) in
+numpy and common deep learning workloads. We also provide a collection of common scheduling templates to schedule those operators on different target platforms.
+
+
+tvm/relay
+---------
+Relay is the high-level functional IR used to represent the end to end models. Various optimizations are supported in relay/transform. There are multiple dialects in relay to support specific perspectives of high-level optimization. Notably ones include qnn(for importing pre-quantized models), vm(for lowering to dynamic vm), memory(for memory optimization).
+
+.. toctree::
+   :maxdepth: 1
+
    relay_intro
-   relay_add_op
    relay_op_strategy
    relay_pass_infra
+   convert_layout
+
+
+tvm/autotvm
+-----------
+
+AutoTVM and Autoscheduler are components for automating the search based transformations. This part of the codebase is fast evolving with the following components:
+
+- Cost model and feature extraction.
+- Logging format to store the benchmark result.
+- Search policy
+
+Automated program optimization is still an active research field. As a result, we try to modularize the design so that researchers can quickly

Review comment:
       ```suggestion
   Automated program optimization is still an active research field. As a result, we have attempted to modularize the design so that researchers may quickly modify a component or apply their own algorithms via the Python bindings. 
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+
+The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we also support the direct translation of a relay function(sub-graph) to external code generators. Importantly, we want to keep the target translation as lightweight as possible and perform most of the lowerings before target translation.
+We also provide a Target structure to specify the compilation target. The transformations before the target translation phase can also be affected by the target — for example, a target's vector length would change the vectorization behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of tvm's runtime is to provide a minimum set of APIs to allow a user to load and execute the compiled artifact in their language of choice, including python, c++, rust, go, java, and javascript. The code snippet below shows such an example in python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for both the generated functions. A runtime.PackedFunc can take arguments and return values with the following types: POD types(int, float), string, runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet below gives an example of an end to end model execution using the same interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs(such as addone), as well as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data structures.
+
+- relay/transform and tir/transform are determinstic rule-based transformations
+- auto_scheduler and autotvm contains the search-based transformations
+
+Finally, the compilation flow example is only a typical use-case of the TVM stack. We expose these key data structures and transformations to python and C++ APIs. As a result, you can use TVM just like the way you use numpy, except that the data structure of interest changes from the numpy.ndarray to tvm.IRModule. Here are some example use-cases:
+
+- Directly construct IRModule using hybrid script for compilations.
+- Compose a custom set of transformations(e.g. customize quantization).
+- Manipulate the IR directly using tvm's python API.
+
+
+
+Logical Architecture Components
+-------------------------------
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_static_overview.svg
+   :align: center
+   :width: 85%
+
+   TVM Architecture Diagram
+
+tvm/support
+-----------
+The support module contains the most common utilities for the infrastructure, such as generic arena allocator, socket, and logging.
+
+
+tvm/runtime
+-----------
+
+The runtime serves as the foundation of the TVM stack. It provides the mechanism to load and execute compiled artifacts. The runtime defines a stable standard set of C API to interface with frontend languages such as python and rust.
+
+`runtime::Object` is one of the primary data structures in TVM runtime besides the `runtime::PackedFunc`. It is a reference-counted base class with type index to support runtime type checking and downcasting. The object system allows the developer to introduce new data structures to the runtime, such as Array, Map, and new IR data structures.
+
+Besides the deployment use-cases, the compiler itself also makes heavy use of the tvm's runtime mechanism. All of the IR data structures are subclasses of `runtime::Object`, as a result, they can be directly accessed and manipulated from the python frontend. We expose the PackedFunc to expose various APIs to the frontend.
+
+Runtime support for different hardware backends are defined in subdirectories of runtime(e.g. runtime/opencl). These hardware-specific runtime modules define APIs for device memory allocation and device function serialization.
+
+`runtime/rpc` implements an RPC support for PackedFunc. We can use the RPC mechanism to send a cross-compiled library to a remote device and benchmark the execution performance. The rpc infrastructure enables data collection from a wide range of hardware backends for learning-based optimizations.
 
-Building a compiler stack for deep learning systems involves many many systems-level design decisions.
-In this part of documentation, we share the rationale for the specific choices made when designing TVM.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    runtime
    debugger
+   virtual_machine
+   introduction_to_module_serialization
+
+tvm/node
+--------
+The node module adds additional features on top of the `runtime::Object` for IR data structures.
+The main features include reflection, serialization, structural equivalence, and hashing.
+
+Thanks to the node module, we can directly access any field of the tvm's IRNode by their name in python.
+
+.. code-block:: python
+
+    x = tvm.tir.Var("x", "int32")
+    y = tvm.tir.Add(x, x)
+    # we can directly use the field name to access the IR structures
+    assert y.a == x
+
+
+We can also directly serialize arbitrary IR node into a JSON format, and load them back.
+The ability to save/store, and inspect an IR node provides a foundation for making the compiler more accessible.
+
+
+tvm/ir
+------
+The `tvm/ir` folder contains the unified data structure and interfaces across for all IR function variants.
+The components in `tvm/ir` are shared by `tvm/relay` and `tvm/tir`, notable ones include
+
+- IRModule
+- Type
+- PassContext and Pass
+- Op
+
+Different variants of functions(e.g. relay.Function and tir.PrimFunc) can co-exist in an IRModule.
+While these variants may not have the same content representation, they use the same data structure to represent types,
+and as a consequence, their function signatures. The unified type system allows one function variant to call into another
+once we clearly define the calling convention and opens doors for future cross-function-variant optimizations.
+
+We also provide a unified PassContext for configuring the pass behavior, and common composite passes to execute a pass pipeline.
+
+Op is the common class to represent all system-defined primitive operator/intrinsics.
+Developers can register new Ops as well as their additional attributes(e.g. whether the Op is elementwise) to the system.
+
+
+tvm/target
+----------
+The target module contains all the code generators that translate an IRModule to a target runtime.Module.
+It also provides a common `Target` class that describes the target.
+
+The compilation pipeline can be customized according to the target by querying the attribute information
+in the target and builtin information registered to each target id(cuda, opencl).
+
+tvm/tir
+-------
+
+TIR contains the definition for the low-level program representations. We use `tir::PrimFunc` to represent functions that can be transformed by TIR passes.
+Besides the IR data structures, the tir module also defines a set of builtin intrinsics and their attributes via the common Op registry, as well as transformation passes in `tir/transform`.
+
+tvm/arith
+---------
+
+This module is closely tied to the TIR. One of the key problems in the low-level code generation is the analysis of the indices'
+arithmetic properties — the positiveness, variable bound, and the integer set that describes the iterator space. arith module provides
+a collection of tools that do (primarily integer) analysis. A TIR pass can use these analyses to simplify and optimize the code.
+
+tvm/te
+------
+
+The name te stands for "tensor expression". This is a domain-specific language module that allows us to construct `tir::PrimFunc` variants quickly by writing tensor expressions.
+Importantly, a tensor expression itself is not a self-contained function that can be stored into IRModule. Instead, it is a fragment of IR that we can stitch together to build an IRModule.
+
+`te/schedule` provides a collection of scheduling primitives to control the function being generated. In the future, we might bring some of
+these scheduling components to the a `tir::PrimFunc` itself.
+
+.. toctree::
+   :maxdepth: 1
+
+   inferbound
    hybrid_script
+
+topi
+----
+While we can build different kinds of operators directly via tir or tensor expression.
+It is still tedious to do so. `topi` provides a set of pre-defined operators (in tensor expression or tir) in
+numpy and common deep learning workloads. We also provide a collection of common scheduling templates to schedule those operators on different target platforms.
+
+
+tvm/relay
+---------
+Relay is the high-level functional IR used to represent the end to end models. Various optimizations are supported in relay/transform. There are multiple dialects in relay to support specific perspectives of high-level optimization. Notably ones include qnn(for importing pre-quantized models), vm(for lowering to dynamic vm), memory(for memory optimization).
+
+.. toctree::
+   :maxdepth: 1
+
    relay_intro
-   relay_add_op
    relay_op_strategy
    relay_pass_infra
+   convert_layout
+
+
+tvm/autotvm
+-----------
+
+AutoTVM and Autoscheduler are components for automating the search based transformations. This part of the codebase is fast evolving with the following components:
+
+- Cost model and feature extraction.
+- Logging format to store the benchmark result.
+- Search policy

Review comment:
       ```suggestion
   - A set of search policies over program transformations. 
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical components.
+  The sections after are specific guides about the logical components, organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we will review an example of end to end compilation flow and discuss the key data structures and the transformations.
+This runtime-based view shows the interactions of the components when running the compiler.
+Then we will review the logical modules of the codebase and their relations. This part provides a more static view of the overall design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture Components`_.
+Each architecture component section contains a short introduction to the corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:
+
+- Importation: The frontend component ingests a model into an IRModule, which contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately equivalent(e.g. in the case of quantization) IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.
+
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify the key data structures and APIs that transform these data structures. One we identified the key data structures, we can then de-couple a system into logical components that either define a collection of key data structures or transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module) contains a collection of functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A relay.Function usually corresponds to an end to end model. You can view a relay.Function as a computational graph with additional support for control-flow, recursion, and complex data structures, if you are familiar with the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store, threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized version.
+- lowering: transform a program to a lower-level representation that is closer to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. The optimizations include common program optimizations such as constant folding and dead-code elimination, and tensor-computation specific passes such as layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) to break the end to end function(e.g. mobilenet) into sub-function(e.g. conv2d-relu) segments. We call these segments primitive functions. This process helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For specific targets, we may also directly go to the target translation phase and use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the overall execution problem. For simple models with known shapes and no control flow, we can lower to a graph runtime that stores the execution structure in a graph. We also support a virtual machine backend for dynamic executions. Finally, we plan to support ahead of time compilation that compiles the high-level execution structure into the executable and generated primitive functions. All of these execution modes are encapsulated by a unified **runtime.Module** interface, which we will discuss in the latter part of the guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many tir passes serve the purpose of lowering. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones, and to decorate the function entry to meet the runtime calling convention. Of course, there are also optimizations passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation to the downstream compilers and only focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and rule-based. One design goal of the TVM stack is to support high-performance code optimizations for different hardware platforms. To do so, we will need to investigate as many optimizations choices as possible, including but not limited to, multi-dimensional tensor access, loop tiling behavior, special accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will take a search and learning-based approach. We first define a collection of actions we can take to transform a program. Example actions include loop transformations, inlining, vectorization. We call these actions **scheduling primitives**. The collection of scheduling primitives defines a search space of possible optimizations we can make to a program. The system will use then searches over different possible scheduling combinations to pick the best one. The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is completed. The compiler can then just lookup the best schedule sequence and apply it to the program. Notably, this schedule application phase **exactly like** the rule-based transformations, enabling us to share the same interface convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function generation problem. This part of the module is called AutoTVM(auto_scheduler). We expect to expand the learning-based transformations to more areas as we continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+
+The target translation phase transforms an IRModule to the corresponding target executable format. For backends such as x86, ARM, we will use the LLVM IRBuilder to build in-memory LLVM IR. We can also generate source-level languages such as CUDA C and OpenCL. Finally, we also support the direct translation of a relay function(sub-graph) to external code generators. Importantly, we want to keep the target translation as lightweight as possible and perform most of the lowerings before target translation.
+We also provide a Target structure to specify the compilation target. The transformations before the target translation phase can also be affected by the target — for example, a target's vector length would change the vectorization behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of tvm's runtime is to provide a minimum set of APIs to allow a user to load and execute the compiled artifact in their language of choice, including python, c++, rust, go, java, and javascript. The code snippet below shows such an example in python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for both the generated functions. A runtime.PackedFunc can take arguments and return values with the following types: POD types(int, float), string, runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet below gives an example of an end to end model execution using the same interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs(such as addone), as well as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data structures.
+
+- relay/transform and tir/transform are determinstic rule-based transformations
+- auto_scheduler and autotvm contains the search-based transformations
+
+Finally, the compilation flow example is only a typical use-case of the TVM stack. We expose these key data structures and transformations to python and C++ APIs. As a result, you can use TVM just like the way you use numpy, except that the data structure of interest changes from the numpy.ndarray to tvm.IRModule. Here are some example use-cases:
+
+- Directly construct IRModule using hybrid script for compilations.
+- Compose a custom set of transformations(e.g. customize quantization).
+- Manipulate the IR directly using tvm's python API.
+
+
+
+Logical Architecture Components
+-------------------------------
+
+.. figure:: https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_static_overview.svg
+   :align: center
+   :width: 85%
+
+   TVM Architecture Diagram
+
+tvm/support
+-----------
+The support module contains the most common utilities for the infrastructure, such as generic arena allocator, socket, and logging.
+
+
+tvm/runtime
+-----------
+
+The runtime serves as the foundation of the TVM stack. It provides the mechanism to load and execute compiled artifacts. The runtime defines a stable standard set of C API to interface with frontend languages such as python and rust.
+
+`runtime::Object` is one of the primary data structures in TVM runtime besides the `runtime::PackedFunc`. It is a reference-counted base class with type index to support runtime type checking and downcasting. The object system allows the developer to introduce new data structures to the runtime, such as Array, Map, and new IR data structures.
+
+Besides the deployment use-cases, the compiler itself also makes heavy use of the tvm's runtime mechanism. All of the IR data structures are subclasses of `runtime::Object`, as a result, they can be directly accessed and manipulated from the python frontend. We expose the PackedFunc to expose various APIs to the frontend.
+
+Runtime support for different hardware backends are defined in subdirectories of runtime(e.g. runtime/opencl). These hardware-specific runtime modules define APIs for device memory allocation and device function serialization.
+
+`runtime/rpc` implements an RPC support for PackedFunc. We can use the RPC mechanism to send a cross-compiled library to a remote device and benchmark the execution performance. The rpc infrastructure enables data collection from a wide range of hardware backends for learning-based optimizations.
 
-Building a compiler stack for deep learning systems involves many many systems-level design decisions.
-In this part of documentation, we share the rationale for the specific choices made when designing TVM.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    runtime
    debugger
+   virtual_machine
+   introduction_to_module_serialization
+
+tvm/node
+--------
+The node module adds additional features on top of the `runtime::Object` for IR data structures.
+The main features include reflection, serialization, structural equivalence, and hashing.
+
+Thanks to the node module, we can directly access any field of the tvm's IRNode by their name in python.
+
+.. code-block:: python
+
+    x = tvm.tir.Var("x", "int32")
+    y = tvm.tir.Add(x, x)
+    # we can directly use the field name to access the IR structures
+    assert y.a == x
+
+
+We can also directly serialize arbitrary IR node into a JSON format, and load them back.
+The ability to save/store, and inspect an IR node provides a foundation for making the compiler more accessible.
+
+
+tvm/ir
+------
+The `tvm/ir` folder contains the unified data structure and interfaces across for all IR function variants.
+The components in `tvm/ir` are shared by `tvm/relay` and `tvm/tir`, notable ones include
+
+- IRModule
+- Type
+- PassContext and Pass
+- Op
+
+Different variants of functions(e.g. relay.Function and tir.PrimFunc) can co-exist in an IRModule.
+While these variants may not have the same content representation, they use the same data structure to represent types,
+and as a consequence, their function signatures. The unified type system allows one function variant to call into another
+once we clearly define the calling convention and opens doors for future cross-function-variant optimizations.
+
+We also provide a unified PassContext for configuring the pass behavior, and common composite passes to execute a pass pipeline.
+
+Op is the common class to represent all system-defined primitive operator/intrinsics.
+Developers can register new Ops as well as their additional attributes(e.g. whether the Op is elementwise) to the system.
+
+
+tvm/target
+----------
+The target module contains all the code generators that translate an IRModule to a target runtime.Module.
+It also provides a common `Target` class that describes the target.
+
+The compilation pipeline can be customized according to the target by querying the attribute information
+in the target and builtin information registered to each target id(cuda, opencl).
+
+tvm/tir
+-------
+
+TIR contains the definition for the low-level program representations. We use `tir::PrimFunc` to represent functions that can be transformed by TIR passes.
+Besides the IR data structures, the tir module also defines a set of builtin intrinsics and their attributes via the common Op registry, as well as transformation passes in `tir/transform`.
+
+tvm/arith
+---------
+
+This module is closely tied to the TIR. One of the key problems in the low-level code generation is the analysis of the indices'
+arithmetic properties — the positiveness, variable bound, and the integer set that describes the iterator space. arith module provides
+a collection of tools that do (primarily integer) analysis. A TIR pass can use these analyses to simplify and optimize the code.
+
+tvm/te
+------
+
+The name te stands for "tensor expression". This is a domain-specific language module that allows us to construct `tir::PrimFunc` variants quickly by writing tensor expressions.
+Importantly, a tensor expression itself is not a self-contained function that can be stored into IRModule. Instead, it is a fragment of IR that we can stitch together to build an IRModule.
+
+`te/schedule` provides a collection of scheduling primitives to control the function being generated. In the future, we might bring some of
+these scheduling components to the a `tir::PrimFunc` itself.
+
+.. toctree::
+   :maxdepth: 1
+
+   inferbound
    hybrid_script
+
+topi
+----
+While we can build different kinds of operators directly via tir or tensor expression.

Review comment:
       ```suggestion
   While possible to construct operators directly via TIR or tensor expressions (TE) for each use case it is tedious to do so. 
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org