You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/03/17 23:12:42 UTC
[GitHub] [tvm] comaniac commented on a change in pull request #7642: [docs] Getting Started With TVM: Tensor Expressions

comaniac commented on a change in pull request #7642:
URL: https://github.com/apache/tvm/pull/7642#discussion_r596429340



##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -17,16 +17,39 @@
 """
 .. _tutorial-tensor-expr-get-started:
 
-Get Started with Tensor Expression
-==================================
+Working with Operators Using Tensor Expressions
+===============================================
 **Author**: `Tianqi Chen <https://tqchen.github.io>`_
 
-This is an introductory tutorial to the Tensor expression language in TVM.
-TVM uses a domain specific tensor expression for efficient kernel construction.
+In this tutorial we will turn our attention to how TVM works with Template

Review comment:
       ```suggestion
   In this tutorial we will turn our attention to how TVM works with Tensor
   ```

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -37,123 +60,82 @@
 # Global declarations of environment.
 
 tgt_host = "llvm"
-# Change it to respective GPU if gpu is enabled Ex: cuda, opencl, rocm
-tgt = "cuda"
 
-######################################################################
-# Vector Add Example
-# ------------------
-# In this tutorial, we will use a vector addition example to demonstrate
-# the workflow.
-#
+# You will get better performance if you can identify the CPU you are targeting and specify it.
+# For example, ``tgt = "llvm -mcpu=broadwell``
+tgt = "llvm"
 
 ######################################################################
-# Describe the Computation
-# ------------------------
-# As a first step, we need to describe our computation.
-# TVM adopts tensor semantics, with each intermediate result
-# represented as a multi-dimensional array. The user needs to describe
-# the computation rule that generates the tensors.
-#
-# We first define a symbolic variable n to represent the shape.
-# We then define two placeholder Tensors, A and B, with given shape (n,)
-#
-# We then describe the result tensor C, with a compute operation.  The
-# compute function takes the shape of the tensor, as well as a lambda
-# function that describes the computation rule for each position of
-# the tensor.
-#
-# No computation happens during this phase, as we are only declaring how
-# the computation should be done.
-#
+# Describing the Vector Computation
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# We describe a vector addition computation. TVM adopts tensor semantics, with
+# each intermediate result represented as a multi-dimensional array. The user
+# needs to describe the computation rule that generates the tensors. We first
+# define a symbolic variable n to represent the shape. We then define two
+# placeholder Tensors, ``A`` and ``B``, with given shape ``(n,)``. We then
+# describe the result tensor ``C``, with a ``compute`` operation. The
+# ``compute`` defines a computation, with the output conforming to the
+# specified tensor shape and the computation to be performed at each position
+# in the tensor defined by the lambda function. Note that while ``n`` is a
+# variable, it defines a consistent shape between the ``A``, ``B`` and ``C``
+# tensors. Remember, no actual computation happens during this phase, as we
+# are only declaring how the computation should be done.
+
 n = te.var("n")
 A = te.placeholder((n,), name="A")
 B = te.placeholder((n,), name="B")
 C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
 print(type(C))
 
 ######################################################################
-# Schedule the Computation
-# ------------------------
-# While the above lines describe the computation rule, we can compute
-# C in many ways since the axis of C can be computed in a data
-# parallel manner.  TVM asks the user to provide a description of the
-# computation called a schedule.
-#
-# A schedule is a set of transformation of computation that transforms
-# the loop of computations in the program.
-#
-# After we construct the schedule, by default the schedule computes
-# C in a serial manner in a row-major order.
+# Create a Default Schedule for the Computation
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# While the above lines describe the computation rule, we can compute C in many
+# different ways. For a tensor with multiple axes, you can choose which axis to
+# iterate over first, or computations can be split across different threads.
+# TVM requires that the user to provide a schedule, which is a description of
+# how the computation should be performed. Scheduling operations within TE
+# can change loop orders, split computations across different threads, group
+# blocks of data together, amongst other operations. An important concept behind
+# schedules is that different schedules for the same operation will produce the
+# same result.

Review comment:
       ```suggestion
   # schedules is that they only describe how the computation is performed, so different schedules for the same TE will produce the
   # same result.
   ```

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -163,52 +145,156 @@
 fadd(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Inspect the Generated Code
-# --------------------------
-# You can inspect the generated code in TVM. The result of tvm.build
-# is a TVM Module. fadd is the host module that contains the host wrapper,
-# it also contains a device module for the CUDA (GPU) function.
-#
-# The following code fetches the device module and prints the content code.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    dev_module = fadd.imported_modules[0]
-    print("-----GPU code-----")
-    print(dev_module.get_source())
-else:
-    print(fadd.get_source())
+################################################################################
+# Updating the Schedule to Use Paralleism
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# Now that we've illustrated the fundamentals of TE, let's go deeper into what
+# schedules do, and how they can be used to optimize tensor expressions for
+# different architectures. A schedule is a series of steps that are applied to
+# an expression to transform it in a number of different ways. When a schedule
+# is applied to an expression in TE, the inputs and outputs remain the same,
+# but when compiled the implementation of the expression can change. This
+# tensor addition, in the default schedule, is run serially but is easy to
+# parallelize across all of the processor threads. We can apply the parallel
+# schedule operation to our computation.
 
-######################################################################
-# .. note:: Code Specialization
-#
-#   As you may have noticed, the declarations of A, B and C all
-#   take the same shape argument, n. TVM will take advantage of this
-#   to pass only a single shape argument to the kernel, as you will find in
-#   the printed device code. This is one form of specialization.
-#
-#   On the host side, TVM will automatically generate check code
-#   that checks the constraints in the parameters. So if you pass
-#   arrays with different shapes into fadd, an error will be raised.
-#
-#   We can do more specializations. For example, we can write
-#   :code:`n = tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`,
-#   in the computation declaration. The generated function will
-#   only take vectors with length 1024.
-#
+s[C].parallel(C.op.axis[0])
 
-######################################################################
-# Save Compiled Module
-# --------------------
-# Besides runtime compilation, we can save the compiled modules into
-# a file and load them back later. This is called ahead of time compilation.
+################################################################################
+# The ``tvm.lower`` command will generate the Intermediate Representation (IR)
+# of the TE, with the corresponding schedule. By lowering the expression as we
+# apply different schedule operations, we can see the effect of scheduling on
+# the ordering of the computation.
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# It's now possible for TVM to run these blocks on independent threads. Let's
+# compile and run this new schedule with the parallel operation applied:
+
+fadd_parallel = tvm.build(s, [A, B, C], tgt, target_host=tgt_host, name="myadd_parallel")
+fadd_parallel(a, b, c)
+
+tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
+
+################################################################################
+# Updating the Schedule to Use Vectorization
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Modern CPUs also have the ability to perform SIMD operations on floating
+# point values, and we can apply another schedule to our computation expression
+# to take advantage of this. Accomplishing this requires multiple steps: first
+# we have to split the schedule into inner and outer loops using the split
+# scheduling primitive. The inner loops can use vectorization to use SIMD
+# instructions using the vectorize scheduling primitive, then the outer loops
+# can be parallelized using the parallel scheduling primitive. Choose the split
+# factor to be the number of threads on your CPU.
+
+# Recreate the schedule, since we modified it with the parallel operation in the previous example
+n = te.var("n")
+A = te.placeholder((n,), name="A")
+B = te.placeholder((n,), name="B")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+
+s = te.create_schedule(C.op)
+
+factor = 4
+
+outer, inner = s[C].split(C.op.axis[0], factor=factor)
+s[C].parallel(outer)
+s[C].vectorize(inner)
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# We've defined, scheduled, and compiled a vector addition operator, which we
+# were then able to execute on the TVM runtime. We can save the operator as a
+# library, which we can then load later using the TVM runtime.
+
+################################################################################
+# Targeting Vector Addition for GPUs (Optional)
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# TVM is capable of targeting multiple architectures. In the next example, we
+# will target compilation of the vector addition to GPUs

Review comment:
       ```suggestion
   # will target compilation of the vector addition to GPUs.
   ```

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -17,16 +17,39 @@
 """
 .. _tutorial-tensor-expr-get-started:
 
-Get Started with Tensor Expression
-==================================
+Working with Operators Using Tensor Expressions
+===============================================
 **Author**: `Tianqi Chen <https://tqchen.github.io>`_
 
-This is an introductory tutorial to the Tensor expression language in TVM.
-TVM uses a domain specific tensor expression for efficient kernel construction.
+In this tutorial we will turn our attention to how TVM works with Template
+Expressions (TE) to create a space to search for performant configurations. TE
+describes tensor computations in a pure functional language (that is each
+expression has no side effects). When viewed in context of the TVM as a whole,
+Relay describes a computation as a set of operators, and each of these
+operators can be represented as a TE expression where each TE expression takes
+an input tensor and produces an output tensor. It's important to note that the
+tensor isn't necessarily a fully materialized array, rather it is a
+representation of a computation. If you want to produce a computation from a
+TE, you will need to use the scheduling features of TVM.

Review comment:
       This is a bit confusing. TE without schedule is basically just TE with navie schedule. With or without using scheduling feature doesn't change the semantic of TE. I tried to repharse this part and please check if that makes sense to you.
   
   ```suggestion
   an input tensor and produces an output tensor. It's important to note that TE
   does not perform actual computation, rather it represents how the output tensor
   should be computed. Note that if you want to optimize the performance of a certain
   TE, you will need to use the scheduling features of TVM.
   ```

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -37,123 +60,82 @@
 # Global declarations of environment.
 
 tgt_host = "llvm"
-# Change it to respective GPU if gpu is enabled Ex: cuda, opencl, rocm
-tgt = "cuda"
 
-######################################################################
-# Vector Add Example
-# ------------------
-# In this tutorial, we will use a vector addition example to demonstrate
-# the workflow.
-#
+# You will get better performance if you can identify the CPU you are targeting and specify it.
+# For example, ``tgt = "llvm -mcpu=broadwell``
+tgt = "llvm"
 
 ######################################################################
-# Describe the Computation
-# ------------------------
-# As a first step, we need to describe our computation.
-# TVM adopts tensor semantics, with each intermediate result
-# represented as a multi-dimensional array. The user needs to describe
-# the computation rule that generates the tensors.
-#
-# We first define a symbolic variable n to represent the shape.
-# We then define two placeholder Tensors, A and B, with given shape (n,)
-#
-# We then describe the result tensor C, with a compute operation.  The
-# compute function takes the shape of the tensor, as well as a lambda
-# function that describes the computation rule for each position of
-# the tensor.
-#
-# No computation happens during this phase, as we are only declaring how
-# the computation should be done.
-#
+# Describing the Vector Computation
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# We describe a vector addition computation. TVM adopts tensor semantics, with
+# each intermediate result represented as a multi-dimensional array. The user
+# needs to describe the computation rule that generates the tensors. We first
+# define a symbolic variable n to represent the shape. We then define two
+# placeholder Tensors, ``A`` and ``B``, with given shape ``(n,)``. We then
+# describe the result tensor ``C``, with a ``compute`` operation. The
+# ``compute`` defines a computation, with the output conforming to the
+# specified tensor shape and the computation to be performed at each position
+# in the tensor defined by the lambda function. Note that while ``n`` is a
+# variable, it defines a consistent shape between the ``A``, ``B`` and ``C``
+# tensors. Remember, no actual computation happens during this phase, as we
+# are only declaring how the computation should be done.
+
 n = te.var("n")
 A = te.placeholder((n,), name="A")
 B = te.placeholder((n,), name="B")
 C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
 print(type(C))
 
 ######################################################################
-# Schedule the Computation
-# ------------------------
-# While the above lines describe the computation rule, we can compute
-# C in many ways since the axis of C can be computed in a data
-# parallel manner.  TVM asks the user to provide a description of the
-# computation called a schedule.
-#
-# A schedule is a set of transformation of computation that transforms
-# the loop of computations in the program.
-#
-# After we construct the schedule, by default the schedule computes
-# C in a serial manner in a row-major order.
+# Create a Default Schedule for the Computation
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# While the above lines describe the computation rule, we can compute C in many
+# different ways. For a tensor with multiple axes, you can choose which axis to
+# iterate over first, or computations can be split across different threads.
+# TVM requires that the user to provide a schedule, which is a description of
+# how the computation should be performed. Scheduling operations within TE
+# can change loop orders, split computations across different threads, group
+# blocks of data together, amongst other operations. An important concept behind
+# schedules is that different schedules for the same operation will produce the
+# same result.
+#
+# TVM allows you to create a default schedule that will compute ``C`` in by

Review comment:
       We sometimes refer "default schedule" to the fallback TOPI schedule, so it would be better to differentiate them as early as possible.
   
   ```suggestion
   # TVM allows you to create a naive schedule that will compute ``C`` in by
   ```

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -163,52 +145,156 @@
 fadd(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Inspect the Generated Code
-# --------------------------
-# You can inspect the generated code in TVM. The result of tvm.build
-# is a TVM Module. fadd is the host module that contains the host wrapper,
-# it also contains a device module for the CUDA (GPU) function.
-#
-# The following code fetches the device module and prints the content code.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    dev_module = fadd.imported_modules[0]
-    print("-----GPU code-----")
-    print(dev_module.get_source())
-else:
-    print(fadd.get_source())
+################################################################################
+# Updating the Schedule to Use Paralleism
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# Now that we've illustrated the fundamentals of TE, let's go deeper into what
+# schedules do, and how they can be used to optimize tensor expressions for
+# different architectures. A schedule is a series of steps that are applied to
+# an expression to transform it in a number of different ways. When a schedule
+# is applied to an expression in TE, the inputs and outputs remain the same,
+# but when compiled the implementation of the expression can change. This
+# tensor addition, in the default schedule, is run serially but is easy to
+# parallelize across all of the processor threads. We can apply the parallel
+# schedule operation to our computation.
 
-######################################################################
-# .. note:: Code Specialization
-#
-#   As you may have noticed, the declarations of A, B and C all
-#   take the same shape argument, n. TVM will take advantage of this
-#   to pass only a single shape argument to the kernel, as you will find in
-#   the printed device code. This is one form of specialization.
-#
-#   On the host side, TVM will automatically generate check code
-#   that checks the constraints in the parameters. So if you pass
-#   arrays with different shapes into fadd, an error will be raised.
-#
-#   We can do more specializations. For example, we can write
-#   :code:`n = tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`,
-#   in the computation declaration. The generated function will
-#   only take vectors with length 1024.
-#
+s[C].parallel(C.op.axis[0])
 
-######################################################################
-# Save Compiled Module
-# --------------------
-# Besides runtime compilation, we can save the compiled modules into
-# a file and load them back later. This is called ahead of time compilation.
+################################################################################
+# The ``tvm.lower`` command will generate the Intermediate Representation (IR)
+# of the TE, with the corresponding schedule. By lowering the expression as we
+# apply different schedule operations, we can see the effect of scheduling on
+# the ordering of the computation.
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# It's now possible for TVM to run these blocks on independent threads. Let's
+# compile and run this new schedule with the parallel operation applied:
+
+fadd_parallel = tvm.build(s, [A, B, C], tgt, target_host=tgt_host, name="myadd_parallel")
+fadd_parallel(a, b, c)
+
+tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
+
+################################################################################
+# Updating the Schedule to Use Vectorization
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Modern CPUs also have the ability to perform SIMD operations on floating
+# point values, and we can apply another schedule to our computation expression
+# to take advantage of this. Accomplishing this requires multiple steps: first
+# we have to split the schedule into inner and outer loops using the split
+# scheduling primitive. The inner loops can use vectorization to use SIMD
+# instructions using the vectorize scheduling primitive, then the outer loops
+# can be parallelized using the parallel scheduling primitive. Choose the split
+# factor to be the number of threads on your CPU.
+
+# Recreate the schedule, since we modified it with the parallel operation in the previous example
+n = te.var("n")
+A = te.placeholder((n,), name="A")
+B = te.placeholder((n,), name="B")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+
+s = te.create_schedule(C.op)
+
+factor = 4
+
+outer, inner = s[C].split(C.op.axis[0], factor=factor)
+s[C].parallel(outer)
+s[C].vectorize(inner)
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# We've defined, scheduled, and compiled a vector addition operator, which we
+# were then able to execute on the TVM runtime. We can save the operator as a
+# library, which we can then load later using the TVM runtime.
+
+################################################################################
+# Targeting Vector Addition for GPUs (Optional)
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# TVM is capable of targeting multiple architectures. In the next example, we
+# will target compilation of the vector addition to GPUs
+
+# If you want to run this code, change ``run_cuda = True``
+run_cuda = False
+if run_cuda:
+
+    # Change this target to the correct backend for you gpu. For example: cuda (NVIDIA GPUs),
+    # rocm (Radeon GPUS), OpenCL (opencl).
+    tgt_gpu = "cuda"
+
+    # Recreate the schedule
+    n = te.var("n")
+    A = te.placeholder((n,), name="A")
+    B = te.placeholder((n,), name="B")
+    C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+    print(type(C))
+
+    s = te.create_schedule(C.op)
+
+    bx, tx = s[C].split(C.op.axis[0], factor=64)
+
+    ################################################################################
+    # Finally we bind the iteration axis bx and tx to threads in the GPU compute
+    # grid. These are GPU specific constructs that allow us to generate code that
+    # runs on GPU.

Review comment:
       Since naive schedule is invalid for GPU, we better emphsize that thread binding is required in this case. 

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -37,123 +60,82 @@
 # Global declarations of environment.
 
 tgt_host = "llvm"
-# Change it to respective GPU if gpu is enabled Ex: cuda, opencl, rocm
-tgt = "cuda"
 
-######################################################################
-# Vector Add Example
-# ------------------
-# In this tutorial, we will use a vector addition example to demonstrate
-# the workflow.
-#
+# You will get better performance if you can identify the CPU you are targeting and specify it.
+# For example, ``tgt = "llvm -mcpu=broadwell``
+tgt = "llvm"
 
 ######################################################################
-# Describe the Computation
-# ------------------------
-# As a first step, we need to describe our computation.
-# TVM adopts tensor semantics, with each intermediate result
-# represented as a multi-dimensional array. The user needs to describe
-# the computation rule that generates the tensors.
-#
-# We first define a symbolic variable n to represent the shape.
-# We then define two placeholder Tensors, A and B, with given shape (n,)
-#
-# We then describe the result tensor C, with a compute operation.  The
-# compute function takes the shape of the tensor, as well as a lambda
-# function that describes the computation rule for each position of
-# the tensor.
-#
-# No computation happens during this phase, as we are only declaring how
-# the computation should be done.
-#
+# Describing the Vector Computation
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# We describe a vector addition computation. TVM adopts tensor semantics, with
+# each intermediate result represented as a multi-dimensional array. The user
+# needs to describe the computation rule that generates the tensors. We first
+# define a symbolic variable n to represent the shape. We then define two
+# placeholder Tensors, ``A`` and ``B``, with given shape ``(n,)``. We then
+# describe the result tensor ``C``, with a ``compute`` operation. The
+# ``compute`` defines a computation, with the output conforming to the
+# specified tensor shape and the computation to be performed at each position
+# in the tensor defined by the lambda function. Note that while ``n`` is a
+# variable, it defines a consistent shape between the ``A``, ``B`` and ``C``
+# tensors. Remember, no actual computation happens during this phase, as we
+# are only declaring how the computation should be done.
+
 n = te.var("n")
 A = te.placeholder((n,), name="A")
 B = te.placeholder((n,), name="B")
 C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
 print(type(C))
 
 ######################################################################
-# Schedule the Computation
-# ------------------------
-# While the above lines describe the computation rule, we can compute
-# C in many ways since the axis of C can be computed in a data
-# parallel manner.  TVM asks the user to provide a description of the
-# computation called a schedule.
-#
-# A schedule is a set of transformation of computation that transforms
-# the loop of computations in the program.
-#
-# After we construct the schedule, by default the schedule computes
-# C in a serial manner in a row-major order.
+# Create a Default Schedule for the Computation
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# While the above lines describe the computation rule, we can compute C in many
+# different ways. For a tensor with multiple axes, you can choose which axis to

Review comment:
       ```suggestion
   # different ways to fit different devices. For a tensor with multiple axes, you can choose which axis to
   ```

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -163,52 +145,156 @@
 fadd(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Inspect the Generated Code
-# --------------------------
-# You can inspect the generated code in TVM. The result of tvm.build
-# is a TVM Module. fadd is the host module that contains the host wrapper,
-# it also contains a device module for the CUDA (GPU) function.
-#
-# The following code fetches the device module and prints the content code.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    dev_module = fadd.imported_modules[0]
-    print("-----GPU code-----")
-    print(dev_module.get_source())
-else:
-    print(fadd.get_source())
+################################################################################
+# Updating the Schedule to Use Paralleism

Review comment:
       It would be better for these sections to print the performance number, so that we can see how schedule affects the performance.

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -37,123 +60,82 @@
 # Global declarations of environment.
 
 tgt_host = "llvm"
-# Change it to respective GPU if gpu is enabled Ex: cuda, opencl, rocm
-tgt = "cuda"
 
-######################################################################
-# Vector Add Example
-# ------------------
-# In this tutorial, we will use a vector addition example to demonstrate
-# the workflow.
-#
+# You will get better performance if you can identify the CPU you are targeting and specify it.
+# For example, ``tgt = "llvm -mcpu=broadwell``
+tgt = "llvm"
 
 ######################################################################
-# Describe the Computation
-# ------------------------
-# As a first step, we need to describe our computation.
-# TVM adopts tensor semantics, with each intermediate result
-# represented as a multi-dimensional array. The user needs to describe
-# the computation rule that generates the tensors.
-#
-# We first define a symbolic variable n to represent the shape.
-# We then define two placeholder Tensors, A and B, with given shape (n,)
-#
-# We then describe the result tensor C, with a compute operation.  The
-# compute function takes the shape of the tensor, as well as a lambda
-# function that describes the computation rule for each position of
-# the tensor.
-#
-# No computation happens during this phase, as we are only declaring how
-# the computation should be done.
-#
+# Describing the Vector Computation
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# We describe a vector addition computation. TVM adopts tensor semantics, with
+# each intermediate result represented as a multi-dimensional array. The user
+# needs to describe the computation rule that generates the tensors. We first
+# define a symbolic variable n to represent the shape. We then define two
+# placeholder Tensors, ``A`` and ``B``, with given shape ``(n,)``. We then
+# describe the result tensor ``C``, with a ``compute`` operation. The
+# ``compute`` defines a computation, with the output conforming to the
+# specified tensor shape and the computation to be performed at each position
+# in the tensor defined by the lambda function. Note that while ``n`` is a
+# variable, it defines a consistent shape between the ``A``, ``B`` and ``C``
+# tensors. Remember, no actual computation happens during this phase, as we
+# are only declaring how the computation should be done.
+
 n = te.var("n")
 A = te.placeholder((n,), name="A")
 B = te.placeholder((n,), name="B")
 C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
 print(type(C))
 
 ######################################################################
-# Schedule the Computation
-# ------------------------
-# While the above lines describe the computation rule, we can compute
-# C in many ways since the axis of C can be computed in a data
-# parallel manner.  TVM asks the user to provide a description of the
-# computation called a schedule.
-#
-# A schedule is a set of transformation of computation that transforms
-# the loop of computations in the program.
-#
-# After we construct the schedule, by default the schedule computes
-# C in a serial manner in a row-major order.
+# Create a Default Schedule for the Computation
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# While the above lines describe the computation rule, we can compute C in many
+# different ways. For a tensor with multiple axes, you can choose which axis to
+# iterate over first, or computations can be split across different threads.
+# TVM requires that the user to provide a schedule, which is a description of
+# how the computation should be performed. Scheduling operations within TE
+# can change loop orders, split computations across different threads, group
+# blocks of data together, amongst other operations. An important concept behind
+# schedules is that different schedules for the same operation will produce the
+# same result.
+#
+# TVM allows you to create a default schedule that will compute ``C`` in by
+# iterating in row major order.
 #
 # .. code-block:: c
 #
 #   for (int i = 0; i < n; ++i) {
 #     C[i] = A[i] + B[i];
 #   }
-#
-s = te.create_schedule(C.op)
 
-######################################################################
-# We used the split construct to split the first axis of C,
-# this will split the original iteration axis into product of
-# two iterations. This is equivalent to the following code.
-#
-# .. code-block:: c
-#
-#   for (int bx = 0; bx < ceil(n / 64); ++bx) {
-#     for (int tx = 0; tx < 64; ++tx) {
-#       int i = bx * 64 + tx;
-#       if (i < n) {
-#         C[i] = A[i] + B[i];
-#       }
-#     }
-#   }
-#
-bx, tx = s[C].split(C.op.axis[0], factor=64)
+s = te.create_schedule(C.op)
 
 ######################################################################
-# Finally we bind the iteration axis bx and tx to threads in the GPU
-# compute grid. These are GPU specific constructs that allow us
-# to generate code that runs on GPU.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    s[C].bind(bx, te.thread_axis("blockIdx.x"))
-    s[C].bind(tx, te.thread_axis("threadIdx.x"))
+# Compile and Evaluate the Default Schedule
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# With the TE expression and a schedule, we can produce runnable code for our
+# target language and architecture, in this case LLVM and a CPU. We provide
+# TVM with the schedule, a list of the TE expressions that are in the schedule,
+# the target and host, and the name of the function we are producing. The result
+# of the output is a type-erased function that can be called directly from Python.
+#
+# In the following line, we use tvm.build to create a function. The build
+# function takes the schedule, the desired signature of the function (including
+# the inputs and outputs) as well as target language we want to compile to.
 
-######################################################################
-# Compilation
-# -----------
-# After we have finished specifying the schedule, we can compile it
-# into a TVM function. By default TVM compiles into a type-erased
-# function that can be directly called from the python side.
-#
-# In the following line, we use tvm.build to create a function.
-# The build function takes the schedule, the desired signature of the
-# function (including the inputs and outputs) as well as target language
-# we want to compile to.
-#
-# The result of compilation fadd is a GPU device function (if GPU is
-# involved) as well as a host wrapper that calls into the GPU
-# function.  fadd is the generated host wrapper function, it contains
-# a reference to the generated device function internally.
-#
 fadd = tvm.build(s, [A, B, C], tgt, target_host=tgt_host, name="myadd")
 
-######################################################################
-# Run the Function
-# ----------------
-# The compiled TVM function is exposes a concise C API
-# that can be invoked from any language.
-#
-# We provide a minimal array API in python to aid quick testing and prototyping.
-# The array API is based on the `DLPack <https://github.com/dmlc/dlpack>`_ standard.
-#
-# - We first create a GPU context.
-# - Then tvm.nd.array copies the data to the GPU.
-# - fadd runs the actual computation.
-# - asnumpy() copies the GPU array back to the CPU and we can use this to verify correctness
-#
+################################################################################
+# Let's run the function, and compare the output to the same computation in
+# numpy. We begin by creating a context, which is a device (CPU, GPU) that TVM can

Review comment:
       Better to not mention GPU here to reduce the confusion, especially the schedule we just created cannot be used on GPU.
   
   ```suggestion
   # numpy. We begin by creating a context, which is a device (CPU in this example) that TVM can
   ```

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -17,16 +17,39 @@
 """
 .. _tutorial-tensor-expr-get-started:
 
-Get Started with Tensor Expression
-==================================
+Working with Operators Using Tensor Expressions
+===============================================
 **Author**: `Tianqi Chen <https://tqchen.github.io>`_
 
-This is an introductory tutorial to the Tensor expression language in TVM.
-TVM uses a domain specific tensor expression for efficient kernel construction.
+In this tutorial we will turn our attention to how TVM works with Template
+Expressions (TE) to create a space to search for performant configurations. TE
+describes tensor computations in a pure functional language (that is each
+expression has no side effects). When viewed in context of the TVM as a whole,
+Relay describes a computation as a set of operators, and each of these
+operators can be represented as a TE expression where each TE expression takes
+an input tensor and produces an output tensor. It's important to note that the
+tensor isn't necessarily a fully materialized array, rather it is a
+representation of a computation. If you want to produce a computation from a
+TE, you will need to use the scheduling features of TVM.
 
-In this tutorial, we will demonstrate the basic workflow to use
-the tensor expression language.
+This is an introductory tutorial to the Tensor expression language in TVM. TVM
+uses a domain specific tensor expression for efficient kernel construction. We
+will demonstrate the basic workflow with two examples of using the tensor expression
+language. The first example introduces TE and scheduling with vector
+addition. The second expands on these concepts with a step-by-step optimization
+of a matrix multiplication with TE. This matrix multiplication example will
+serve as the comparative basis for future tutorials covering more advanced
+features of TVM.
 """
+
+################################################################################
+# Example 1: Writing and Scheduling Vector Addition in TE for CPU
+# ---------------------------------------------------------------
+#
+# Let's look at an example in Python in which we will implement a schedule for
+# vector addition, targeted towards a CPU. We begin by initializing a TVM
+# environment.

Review comment:
       ```suggestion
   # Let's look at an example in Python in which we will implement a TE for
   # vector addition, followed by a schedule targeted towards a CPU. We begin by initializing a TVM
   # environment.
   ```

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -37,123 +60,82 @@
 # Global declarations of environment.
 
 tgt_host = "llvm"
-# Change it to respective GPU if gpu is enabled Ex: cuda, opencl, rocm
-tgt = "cuda"
 
-######################################################################
-# Vector Add Example
-# ------------------
-# In this tutorial, we will use a vector addition example to demonstrate
-# the workflow.
-#
+# You will get better performance if you can identify the CPU you are targeting and specify it.
+# For example, ``tgt = "llvm -mcpu=broadwell``
+tgt = "llvm"
 
 ######################################################################
-# Describe the Computation
-# ------------------------
-# As a first step, we need to describe our computation.
-# TVM adopts tensor semantics, with each intermediate result
-# represented as a multi-dimensional array. The user needs to describe
-# the computation rule that generates the tensors.
-#
-# We first define a symbolic variable n to represent the shape.
-# We then define two placeholder Tensors, A and B, with given shape (n,)
-#
-# We then describe the result tensor C, with a compute operation.  The
-# compute function takes the shape of the tensor, as well as a lambda
-# function that describes the computation rule for each position of
-# the tensor.
-#
-# No computation happens during this phase, as we are only declaring how
-# the computation should be done.
-#
+# Describing the Vector Computation
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# We describe a vector addition computation. TVM adopts tensor semantics, with
+# each intermediate result represented as a multi-dimensional array. The user
+# needs to describe the computation rule that generates the tensors. We first
+# define a symbolic variable n to represent the shape. We then define two
+# placeholder Tensors, ``A`` and ``B``, with given shape ``(n,)``. We then
+# describe the result tensor ``C``, with a ``compute`` operation. The
+# ``compute`` defines a computation, with the output conforming to the
+# specified tensor shape and the computation to be performed at each position
+# in the tensor defined by the lambda function. Note that while ``n`` is a
+# variable, it defines a consistent shape between the ``A``, ``B`` and ``C``
+# tensors. Remember, no actual computation happens during this phase, as we
+# are only declaring how the computation should be done.
+
 n = te.var("n")
 A = te.placeholder((n,), name="A")
 B = te.placeholder((n,), name="B")
 C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
 print(type(C))
 
 ######################################################################
-# Schedule the Computation
-# ------------------------
-# While the above lines describe the computation rule, we can compute
-# C in many ways since the axis of C can be computed in a data
-# parallel manner.  TVM asks the user to provide a description of the
-# computation called a schedule.
-#
-# A schedule is a set of transformation of computation that transforms
-# the loop of computations in the program.
-#
-# After we construct the schedule, by default the schedule computes
-# C in a serial manner in a row-major order.
+# Create a Default Schedule for the Computation
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# While the above lines describe the computation rule, we can compute C in many

Review comment:
       ```suggestion
   # While the above lines describe the computation rule, we can compute ``C`` in many
   ```

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -163,52 +145,156 @@
 fadd(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Inspect the Generated Code
-# --------------------------
-# You can inspect the generated code in TVM. The result of tvm.build
-# is a TVM Module. fadd is the host module that contains the host wrapper,
-# it also contains a device module for the CUDA (GPU) function.
-#
-# The following code fetches the device module and prints the content code.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    dev_module = fadd.imported_modules[0]
-    print("-----GPU code-----")
-    print(dev_module.get_source())
-else:
-    print(fadd.get_source())
+################################################################################
+# Updating the Schedule to Use Paralleism
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# Now that we've illustrated the fundamentals of TE, let's go deeper into what
+# schedules do, and how they can be used to optimize tensor expressions for
+# different architectures. A schedule is a series of steps that are applied to
+# an expression to transform it in a number of different ways. When a schedule
+# is applied to an expression in TE, the inputs and outputs remain the same,
+# but when compiled the implementation of the expression can change. This
+# tensor addition, in the default schedule, is run serially but is easy to
+# parallelize across all of the processor threads. We can apply the parallel
+# schedule operation to our computation.
 
-######################################################################
-# .. note:: Code Specialization
-#
-#   As you may have noticed, the declarations of A, B and C all
-#   take the same shape argument, n. TVM will take advantage of this
-#   to pass only a single shape argument to the kernel, as you will find in
-#   the printed device code. This is one form of specialization.
-#
-#   On the host side, TVM will automatically generate check code
-#   that checks the constraints in the parameters. So if you pass
-#   arrays with different shapes into fadd, an error will be raised.
-#
-#   We can do more specializations. For example, we can write
-#   :code:`n = tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`,
-#   in the computation declaration. The generated function will
-#   only take vectors with length 1024.
-#
+s[C].parallel(C.op.axis[0])
 
-######################################################################
-# Save Compiled Module
-# --------------------
-# Besides runtime compilation, we can save the compiled modules into
-# a file and load them back later. This is called ahead of time compilation.
+################################################################################
+# The ``tvm.lower`` command will generate the Intermediate Representation (IR)
+# of the TE, with the corresponding schedule. By lowering the expression as we
+# apply different schedule operations, we can see the effect of scheduling on
+# the ordering of the computation.
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# It's now possible for TVM to run these blocks on independent threads. Let's
+# compile and run this new schedule with the parallel operation applied:
+
+fadd_parallel = tvm.build(s, [A, B, C], tgt, target_host=tgt_host, name="myadd_parallel")
+fadd_parallel(a, b, c)
+
+tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
+
+################################################################################
+# Updating the Schedule to Use Vectorization
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Modern CPUs also have the ability to perform SIMD operations on floating
+# point values, and we can apply another schedule to our computation expression
+# to take advantage of this. Accomplishing this requires multiple steps: first
+# we have to split the schedule into inner and outer loops using the split
+# scheduling primitive. The inner loops can use vectorization to use SIMD
+# instructions using the vectorize scheduling primitive, then the outer loops
+# can be parallelized using the parallel scheduling primitive. Choose the split
+# factor to be the number of threads on your CPU.
+
+# Recreate the schedule, since we modified it with the parallel operation in the previous example
+n = te.var("n")
+A = te.placeholder((n,), name="A")
+B = te.placeholder((n,), name="B")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+
+s = te.create_schedule(C.op)
+
+factor = 4
+
+outer, inner = s[C].split(C.op.axis[0], factor=factor)
+s[C].parallel(outer)
+s[C].vectorize(inner)
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# We've defined, scheduled, and compiled a vector addition operator, which we
+# were then able to execute on the TVM runtime. We can save the operator as a
+# library, which we can then load later using the TVM runtime.
+
+################################################################################
+# Targeting Vector Addition for GPUs (Optional)
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# TVM is capable of targeting multiple architectures. In the next example, we
+# will target compilation of the vector addition to GPUs
+
+# If you want to run this code, change ``run_cuda = True``
+run_cuda = False
+if run_cuda:
+
+    # Change this target to the correct backend for you gpu. For example: cuda (NVIDIA GPUs),
+    # rocm (Radeon GPUS), OpenCL (opencl).
+    tgt_gpu = "cuda"
+
+    # Recreate the schedule
+    n = te.var("n")
+    A = te.placeholder((n,), name="A")
+    B = te.placeholder((n,), name="B")
+    C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+    print(type(C))
+
+    s = te.create_schedule(C.op)
+
+    bx, tx = s[C].split(C.op.axis[0], factor=64)
+
+    ################################################################################
+    # Finally we bind the iteration axis bx and tx to threads in the GPU compute
+    # grid. These are GPU specific constructs that allow us to generate code that
+    # runs on GPU.
+
+    if tgt_gpu == "cuda" or tgt_gpu == "rocm" or tgt_gpu.startswith("opencl"):

Review comment:
       Seems no need.

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -37,123 +60,82 @@
 # Global declarations of environment.
 
 tgt_host = "llvm"
-# Change it to respective GPU if gpu is enabled Ex: cuda, opencl, rocm
-tgt = "cuda"
 
-######################################################################
-# Vector Add Example
-# ------------------
-# In this tutorial, we will use a vector addition example to demonstrate
-# the workflow.
-#
+# You will get better performance if you can identify the CPU you are targeting and specify it.
+# For example, ``tgt = "llvm -mcpu=broadwell``

Review comment:
       It'd be better to provide more information about how to determine the proper target. cc @junrushao1994 

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -163,52 +145,156 @@
 fadd(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Inspect the Generated Code
-# --------------------------
-# You can inspect the generated code in TVM. The result of tvm.build
-# is a TVM Module. fadd is the host module that contains the host wrapper,
-# it also contains a device module for the CUDA (GPU) function.
-#
-# The following code fetches the device module and prints the content code.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    dev_module = fadd.imported_modules[0]
-    print("-----GPU code-----")
-    print(dev_module.get_source())
-else:
-    print(fadd.get_source())
+################################################################################
+# Updating the Schedule to Use Paralleism
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# Now that we've illustrated the fundamentals of TE, let's go deeper into what
+# schedules do, and how they can be used to optimize tensor expressions for
+# different architectures. A schedule is a series of steps that are applied to
+# an expression to transform it in a number of different ways. When a schedule
+# is applied to an expression in TE, the inputs and outputs remain the same,
+# but when compiled the implementation of the expression can change. This
+# tensor addition, in the default schedule, is run serially but is easy to
+# parallelize across all of the processor threads. We can apply the parallel
+# schedule operation to our computation.
 
-######################################################################
-# .. note:: Code Specialization
-#
-#   As you may have noticed, the declarations of A, B and C all
-#   take the same shape argument, n. TVM will take advantage of this
-#   to pass only a single shape argument to the kernel, as you will find in
-#   the printed device code. This is one form of specialization.
-#
-#   On the host side, TVM will automatically generate check code
-#   that checks the constraints in the parameters. So if you pass
-#   arrays with different shapes into fadd, an error will be raised.
-#
-#   We can do more specializations. For example, we can write
-#   :code:`n = tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`,
-#   in the computation declaration. The generated function will
-#   only take vectors with length 1024.
-#
+s[C].parallel(C.op.axis[0])
 
-######################################################################
-# Save Compiled Module
-# --------------------
-# Besides runtime compilation, we can save the compiled modules into
-# a file and load them back later. This is called ahead of time compilation.
+################################################################################
+# The ``tvm.lower`` command will generate the Intermediate Representation (IR)
+# of the TE, with the corresponding schedule. By lowering the expression as we
+# apply different schedule operations, we can see the effect of scheduling on
+# the ordering of the computation.
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# It's now possible for TVM to run these blocks on independent threads. Let's
+# compile and run this new schedule with the parallel operation applied:
+
+fadd_parallel = tvm.build(s, [A, B, C], tgt, target_host=tgt_host, name="myadd_parallel")
+fadd_parallel(a, b, c)
+
+tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
+
+################################################################################
+# Updating the Schedule to Use Vectorization
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Modern CPUs also have the ability to perform SIMD operations on floating
+# point values, and we can apply another schedule to our computation expression
+# to take advantage of this. Accomplishing this requires multiple steps: first
+# we have to split the schedule into inner and outer loops using the split
+# scheduling primitive. The inner loops can use vectorization to use SIMD
+# instructions using the vectorize scheduling primitive, then the outer loops
+# can be parallelized using the parallel scheduling primitive. Choose the split
+# factor to be the number of threads on your CPU.
+
+# Recreate the schedule, since we modified it with the parallel operation in the previous example
+n = te.var("n")
+A = te.placeholder((n,), name="A")
+B = te.placeholder((n,), name="B")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+
+s = te.create_schedule(C.op)
+
+factor = 4

Review comment:
       Better to explain where this 4 comes from.

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -255,41 +340,39 @@
 fadd1(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
+################################################################################
 # Pack Everything into One Library
-# --------------------------------
-# In the above example, we store the device and host code separately.
-# TVM also supports export everything as one shared library.
-# Under the hood, we pack the device modules into binary blobs and link
-# them together with the host code.
-# Currently we support packing of Metal, OpenCL and CUDA modules.
-#
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# In the above example, we store the device and host code separately. TVM also
+# supports export everything as one shared library. Under the hood, we pack
+# the device modules into binary blobs and link them together with the host
+# code. Currently we support packing of Metal, OpenCL and CUDA modules.
+
 fadd.export_library(temp.relpath("myadd_pack.so"))
 fadd2 = tvm.runtime.load_module(temp.relpath("myadd_pack.so"))
 fadd2(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
+################################################################################
 # .. note:: Runtime API and Thread-Safety
 #
-#   The compiled modules of TVM do not depend on the TVM compiler.
-#   Instead, they only depend on a minimum runtime library.
-#   The TVM runtime library wraps the device drivers and provides
-#   thread-safe and device agnostic calls into the compiled functions.
-#
-#   This means that you can call the compiled TVM functions from any thread,
-#   on any GPUs.
+#   The compiled modules of TVM do not depend on the TVM compiler. Instead,
+#   they only depend on a minimum runtime library. The TVM runtime library
+#   wraps the device drivers and provides thread-safe and device agnostic calls
+#   into the compiled functions.
 #
+#   This means that you can call the compiled TVM functions from any thread, on
+#   any GPUs, provided that you have compiled the code for that GPU.
 
-######################################################################
+################################################################################
 # Generate OpenCL Code

Review comment:
       I actually don't get the point of mentioning OpenCL and this confuses me a lot. Specifically, IIUC, `s` should refer to the GPU schedule instead of CPU since OpenCL also needs thread binding. If this is the case, this block should be in `run_cuda` as well, but this is also weird to put a part for CPU inside the `run_cuda` block. In short, I think we could remove this part if there's no specific meaning.

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -163,52 +145,156 @@
 fadd(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Inspect the Generated Code
-# --------------------------
-# You can inspect the generated code in TVM. The result of tvm.build
-# is a TVM Module. fadd is the host module that contains the host wrapper,
-# it also contains a device module for the CUDA (GPU) function.
-#
-# The following code fetches the device module and prints the content code.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    dev_module = fadd.imported_modules[0]
-    print("-----GPU code-----")
-    print(dev_module.get_source())
-else:
-    print(fadd.get_source())
+################################################################################
+# Updating the Schedule to Use Paralleism
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# Now that we've illustrated the fundamentals of TE, let's go deeper into what
+# schedules do, and how they can be used to optimize tensor expressions for
+# different architectures. A schedule is a series of steps that are applied to
+# an expression to transform it in a number of different ways. When a schedule
+# is applied to an expression in TE, the inputs and outputs remain the same,
+# but when compiled the implementation of the expression can change. This
+# tensor addition, in the default schedule, is run serially but is easy to
+# parallelize across all of the processor threads. We can apply the parallel
+# schedule operation to our computation.
 
-######################################################################
-# .. note:: Code Specialization
-#
-#   As you may have noticed, the declarations of A, B and C all
-#   take the same shape argument, n. TVM will take advantage of this
-#   to pass only a single shape argument to the kernel, as you will find in
-#   the printed device code. This is one form of specialization.
-#
-#   On the host side, TVM will automatically generate check code
-#   that checks the constraints in the parameters. So if you pass
-#   arrays with different shapes into fadd, an error will be raised.
-#
-#   We can do more specializations. For example, we can write
-#   :code:`n = tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`,
-#   in the computation declaration. The generated function will
-#   only take vectors with length 1024.
-#
+s[C].parallel(C.op.axis[0])
 
-######################################################################
-# Save Compiled Module
-# --------------------
-# Besides runtime compilation, we can save the compiled modules into
-# a file and load them back later. This is called ahead of time compilation.
+################################################################################
+# The ``tvm.lower`` command will generate the Intermediate Representation (IR)
+# of the TE, with the corresponding schedule. By lowering the expression as we
+# apply different schedule operations, we can see the effect of scheduling on
+# the ordering of the computation.
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# It's now possible for TVM to run these blocks on independent threads. Let's
+# compile and run this new schedule with the parallel operation applied:
+
+fadd_parallel = tvm.build(s, [A, B, C], tgt, target_host=tgt_host, name="myadd_parallel")
+fadd_parallel(a, b, c)
+
+tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
+
+################################################################################
+# Updating the Schedule to Use Vectorization
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Modern CPUs also have the ability to perform SIMD operations on floating
+# point values, and we can apply another schedule to our computation expression
+# to take advantage of this. Accomplishing this requires multiple steps: first
+# we have to split the schedule into inner and outer loops using the split
+# scheduling primitive. The inner loops can use vectorization to use SIMD
+# instructions using the vectorize scheduling primitive, then the outer loops
+# can be parallelized using the parallel scheduling primitive. Choose the split
+# factor to be the number of threads on your CPU.
+
+# Recreate the schedule, since we modified it with the parallel operation in the previous example
+n = te.var("n")
+A = te.placeholder((n,), name="A")
+B = te.placeholder((n,), name="B")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+
+s = te.create_schedule(C.op)
+
+factor = 4
+
+outer, inner = s[C].split(C.op.axis[0], factor=factor)
+s[C].parallel(outer)
+s[C].vectorize(inner)
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# We've defined, scheduled, and compiled a vector addition operator, which we
+# were then able to execute on the TVM runtime. We can save the operator as a
+# library, which we can then load later using the TVM runtime.
+
+################################################################################
+# Targeting Vector Addition for GPUs (Optional)
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# TVM is capable of targeting multiple architectures. In the next example, we
+# will target compilation of the vector addition to GPUs
+
+# If you want to run this code, change ``run_cuda = True``

Review comment:
       Mention something like we don't run the following example by default due to CI.

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -302,18 +385,452 @@
     fadd_cl(a, b, c)
     tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Summary
-# -------
-# This tutorial provides a walk through of TVM workflow using
-# a vector add example. The general workflow is
+################################################################################
+# .. note:: Code Specialization
+#
+#   As you may have noticed, the declarations of A, B and C all take the same
+#   shape argument, n. TVM will take advantage of this to pass only a single
+#   shape argument to the kernel, as you will find in the printed device code.
+#   This is one form of specialization.
+#
+#   On the host side, TVM will automatically generate check code that checks
+#   the constraints in the parameters. So if you pass arrays with different
+#   shapes into fadd, an error will be raised.
+#
+#   We can do more specializations. For example, we can write :code:`n =
+#   tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`, in the
+#   computation declaration. The generated function will only take vectors with
+#   length 1024.

Review comment:
       I feel this information is way more important then the OpenCL example.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org