You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/03/18 16:08:42 UTC
[GitHub] [tvm] Hzfengsy commented on a change in pull request #7642: [docs] Getting Started With TVM: Tensor Expressions

Hzfengsy commented on a change in pull request #7642:
URL: https://github.com/apache/tvm/pull/7642#discussion_r596989768



##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -17,16 +17,39 @@
 """
 .. _tutorial-tensor-expr-get-started:
 
-Get Started with Tensor Expression
-==================================
+Working with Operators Using Tensor Expressions
+===============================================
 **Author**: `Tianqi Chen <https://tqchen.github.io>`_
 
-This is an introductory tutorial to the Tensor expression language in TVM.
-TVM uses a domain specific tensor expression for efficient kernel construction.
+In this tutorial we will turn our attention to how TVM works with Template
+Expressions (TE) to create a space to search for performant configurations. TE
+describes tensor computations in a pure functional language (that is each
+expression has no side effects). When viewed in context of the TVM as a whole,
+Relay describes a computation as a set of operators, and each of these
+operators can be represented as a TE expression where each TE expression takes
+an input tensor and produces an output tensor. It's important to note that the

Review comment:
       ```suggestion
   input tensors and produces an output tensor. It's important to note that the
   ```

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -17,16 +17,39 @@
 """
 .. _tutorial-tensor-expr-get-started:
 
-Get Started with Tensor Expression
-==================================
+Working with Operators Using Tensor Expressions
+===============================================
 **Author**: `Tianqi Chen <https://tqchen.github.io>`_
 
-This is an introductory tutorial to the Tensor expression language in TVM.
-TVM uses a domain specific tensor expression for efficient kernel construction.
+In this tutorial we will turn our attention to how TVM works with Template
+Expressions (TE) to create a space to search for performant configurations. TE
+describes tensor computations in a pure functional language (that is each
+expression has no side effects). When viewed in context of the TVM as a whole,
+Relay describes a computation as a set of operators, and each of these
+operators can be represented as a TE expression where each TE expression takes
+an input tensor and produces an output tensor. It's important to note that the
+tensor isn't necessarily a fully materialized array, rather it is a
+representation of a computation. If you want to produce a computation from a
+TE, you will need to use the scheduling features of TVM.
 
-In this tutorial, we will demonstrate the basic workflow to use
-the tensor expression language.
+This is an introductory tutorial to the Tensor expression language in TVM. TVM
+uses a domain specific tensor expression for efficient kernel construction. We
+will demonstrate the basic workflow with two examples of using the tensor expression
+language. The first example introduces TE and scheduling with vector
+addition. The second expands on these concepts with a step-by-step optimization
+of a matrix multiplication with TE. This matrix multiplication example will
+serve as the comparative basis for future tutorials covering more advanced
+features of TVM.
 """
+
+################################################################################
+# Example 1: Writing and Scheduling Vector Addition in TE for CPU
+# ---------------------------------------------------------------
+#
+# Let's look at an example in Python in which we will implement a schedule for
+# vector addition, targeted towards a CPU. We begin by initializing a TVM
+# environment.
+
 from __future__ import absolute_import, print_function

Review comment:
       Not sure whether this import is necessary

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -163,52 +145,156 @@
 fadd(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Inspect the Generated Code
-# --------------------------
-# You can inspect the generated code in TVM. The result of tvm.build
-# is a TVM Module. fadd is the host module that contains the host wrapper,
-# it also contains a device module for the CUDA (GPU) function.
-#
-# The following code fetches the device module and prints the content code.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    dev_module = fadd.imported_modules[0]
-    print("-----GPU code-----")
-    print(dev_module.get_source())
-else:
-    print(fadd.get_source())
+################################################################################
+# Updating the Schedule to Use Paralleism
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# Now that we've illustrated the fundamentals of TE, let's go deeper into what
+# schedules do, and how they can be used to optimize tensor expressions for
+# different architectures. A schedule is a series of steps that are applied to
+# an expression to transform it in a number of different ways. When a schedule
+# is applied to an expression in TE, the inputs and outputs remain the same,
+# but when compiled the implementation of the expression can change. This
+# tensor addition, in the default schedule, is run serially but is easy to
+# parallelize across all of the processor threads. We can apply the parallel
+# schedule operation to our computation.
 
-######################################################################
-# .. note:: Code Specialization
-#
-#   As you may have noticed, the declarations of A, B and C all
-#   take the same shape argument, n. TVM will take advantage of this
-#   to pass only a single shape argument to the kernel, as you will find in
-#   the printed device code. This is one form of specialization.
-#
-#   On the host side, TVM will automatically generate check code
-#   that checks the constraints in the parameters. So if you pass
-#   arrays with different shapes into fadd, an error will be raised.
-#
-#   We can do more specializations. For example, we can write
-#   :code:`n = tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`,
-#   in the computation declaration. The generated function will
-#   only take vectors with length 1024.
-#
+s[C].parallel(C.op.axis[0])
 
-######################################################################
-# Save Compiled Module
-# --------------------
-# Besides runtime compilation, we can save the compiled modules into
-# a file and load them back later. This is called ahead of time compilation.
+################################################################################
+# The ``tvm.lower`` command will generate the Intermediate Representation (IR)
+# of the TE, with the corresponding schedule. By lowering the expression as we
+# apply different schedule operations, we can see the effect of scheduling on
+# the ordering of the computation.
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))

Review comment:
       It's better to describe `simple_mode` and why we need it here

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -163,52 +145,156 @@
 fadd(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Inspect the Generated Code
-# --------------------------
-# You can inspect the generated code in TVM. The result of tvm.build
-# is a TVM Module. fadd is the host module that contains the host wrapper,
-# it also contains a device module for the CUDA (GPU) function.
-#
-# The following code fetches the device module and prints the content code.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    dev_module = fadd.imported_modules[0]
-    print("-----GPU code-----")
-    print(dev_module.get_source())
-else:
-    print(fadd.get_source())
+################################################################################
+# Updating the Schedule to Use Paralleism
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# Now that we've illustrated the fundamentals of TE, let's go deeper into what
+# schedules do, and how they can be used to optimize tensor expressions for
+# different architectures. A schedule is a series of steps that are applied to
+# an expression to transform it in a number of different ways. When a schedule
+# is applied to an expression in TE, the inputs and outputs remain the same,
+# but when compiled the implementation of the expression can change. This
+# tensor addition, in the default schedule, is run serially but is easy to
+# parallelize across all of the processor threads. We can apply the parallel
+# schedule operation to our computation.
 
-######################################################################
-# .. note:: Code Specialization
-#
-#   As you may have noticed, the declarations of A, B and C all
-#   take the same shape argument, n. TVM will take advantage of this
-#   to pass only a single shape argument to the kernel, as you will find in
-#   the printed device code. This is one form of specialization.
-#
-#   On the host side, TVM will automatically generate check code
-#   that checks the constraints in the parameters. So if you pass
-#   arrays with different shapes into fadd, an error will be raised.
-#
-#   We can do more specializations. For example, we can write
-#   :code:`n = tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`,
-#   in the computation declaration. The generated function will
-#   only take vectors with length 1024.
-#
+s[C].parallel(C.op.axis[0])
 
-######################################################################
-# Save Compiled Module
-# --------------------
-# Besides runtime compilation, we can save the compiled modules into
-# a file and load them back later. This is called ahead of time compilation.
+################################################################################
+# The ``tvm.lower`` command will generate the Intermediate Representation (IR)
+# of the TE, with the corresponding schedule. By lowering the expression as we
+# apply different schedule operations, we can see the effect of scheduling on
+# the ordering of the computation.
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# It's now possible for TVM to run these blocks on independent threads. Let's
+# compile and run this new schedule with the parallel operation applied:
+
+fadd_parallel = tvm.build(s, [A, B, C], tgt, target_host=tgt_host, name="myadd_parallel")
+fadd_parallel(a, b, c)
+
+tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
+
+################################################################################
+# Updating the Schedule to Use Vectorization
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Modern CPUs also have the ability to perform SIMD operations on floating
+# point values, and we can apply another schedule to our computation expression
+# to take advantage of this. Accomplishing this requires multiple steps: first
+# we have to split the schedule into inner and outer loops using the split
+# scheduling primitive. The inner loops can use vectorization to use SIMD
+# instructions using the vectorize scheduling primitive, then the outer loops
+# can be parallelized using the parallel scheduling primitive. Choose the split
+# factor to be the number of threads on your CPU.
+
+# Recreate the schedule, since we modified it with the parallel operation in the previous example
+n = te.var("n")
+A = te.placeholder((n,), name="A")
+B = te.placeholder((n,), name="B")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+
+s = te.create_schedule(C.op)
+
+factor = 4
+
+outer, inner = s[C].split(C.op.axis[0], factor=factor)
+s[C].parallel(outer)
+s[C].vectorize(inner)
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# We've defined, scheduled, and compiled a vector addition operator, which we
+# were then able to execute on the TVM runtime. We can save the operator as a
+# library, which we can then load later using the TVM runtime.
+
+################################################################################
+# Targeting Vector Addition for GPUs (Optional)
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# TVM is capable of targeting multiple architectures. In the next example, we
+# will target compilation of the vector addition to GPUs
+
+# If you want to run this code, change ``run_cuda = True``

Review comment:
       I think we can run cuda tutorial in CI. There is devices and also it will not cost too much time (like auto-tuning)

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -37,123 +60,82 @@
 # Global declarations of environment.
 
 tgt_host = "llvm"
-# Change it to respective GPU if gpu is enabled Ex: cuda, opencl, rocm
-tgt = "cuda"
 
-######################################################################
-# Vector Add Example
-# ------------------
-# In this tutorial, we will use a vector addition example to demonstrate
-# the workflow.
-#
+# You will get better performance if you can identify the CPU you are targeting and specify it.
+# For example, ``tgt = "llvm -mcpu=broadwell``
+tgt = "llvm"
 
 ######################################################################
-# Describe the Computation
-# ------------------------
-# As a first step, we need to describe our computation.
-# TVM adopts tensor semantics, with each intermediate result
-# represented as a multi-dimensional array. The user needs to describe
-# the computation rule that generates the tensors.
-#
-# We first define a symbolic variable n to represent the shape.
-# We then define two placeholder Tensors, A and B, with given shape (n,)
-#
-# We then describe the result tensor C, with a compute operation.  The
-# compute function takes the shape of the tensor, as well as a lambda
-# function that describes the computation rule for each position of
-# the tensor.
-#
-# No computation happens during this phase, as we are only declaring how
-# the computation should be done.
-#
+# Describing the Vector Computation
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# We describe a vector addition computation. TVM adopts tensor semantics, with
+# each intermediate result represented as a multi-dimensional array. The user
+# needs to describe the computation rule that generates the tensors. We first
+# define a symbolic variable n to represent the shape. We then define two
+# placeholder Tensors, ``A`` and ``B``, with given shape ``(n,)``. We then
+# describe the result tensor ``C``, with a ``compute`` operation. The
+# ``compute`` defines a computation, with the output conforming to the
+# specified tensor shape and the computation to be performed at each position
+# in the tensor defined by the lambda function. Note that while ``n`` is a
+# variable, it defines a consistent shape between the ``A``, ``B`` and ``C``
+# tensors. Remember, no actual computation happens during this phase, as we
+# are only declaring how the computation should be done.
+
 n = te.var("n")
 A = te.placeholder((n,), name="A")
 B = te.placeholder((n,), name="B")
 C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")

Review comment:
       Better to explain what `lambda i` means

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -302,18 +385,452 @@
     fadd_cl(a, b, c)
     tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Summary
-# -------
-# This tutorial provides a walk through of TVM workflow using
-# a vector add example. The general workflow is
+################################################################################
+# .. note:: Code Specialization
+#
+#   As you may have noticed, the declarations of A, B and C all take the same
+#   shape argument, n. TVM will take advantage of this to pass only a single
+#   shape argument to the kernel, as you will find in the printed device code.
+#   This is one form of specialization.
+#
+#   On the host side, TVM will automatically generate check code that checks
+#   the constraints in the parameters. So if you pass arrays with different
+#   shapes into fadd, an error will be raised.
+#
+#   We can do more specializations. For example, we can write :code:`n =
+#   tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`, in the
+#   computation declaration. The generated function will only take vectors with
+#   length 1024.
+
+################################################################################
+# .. note:: TE Scheduling Primitives
+#
+#   TVM includes a number of different scheduling primitives:
+#
+#   - split: splits a specified axis into two axises by the defined factor.
+#   - tile: tiles will split a computation across two axes by the defined factors.
+#   - fuse: fuses two consecutive axises of one computation.
+#   - reorder: can reorder the axises of a computation into a defined order.
+#   - bind: can bind a computation to a specific thread, useful in GPU programming.
+#   - compute_at: by default, TVM will compute tensors at the outermost level
+#     of the function, or the root, by default. compute_at specifies that one
+#     tensor should be computed at the first axis of computation for another
+#     operator.
+#   - compute_inline: when marked inline, a computation will be expanded then
+#     inserted into the address where the tensor is required.
+#   - compute_root: moves a computation to the outermost layer, or root, of the
+#     function. This means that stage of the computation will be fully computed
+#     before it moves on to the next stage.
+#
+#   A complete description of these primitives can be found in the
+# [Schedule Primitives](https://tvm.apache.org/docs/tutorials/language/schedule_primitives.html) docs page.
+
+################################################################################
+# Example 2: Manually Optimizing Matrix Multiplication with TE
+# ------------------------------------------------------------
+#
+# Now we will consider a second, more advanced example, demonstrating how with
+# just 18 lines of python code TVM speeds up a common matrix multiplication operation by 18x.
+#
+# **Matrix multiplication is a compute intensive operation. There are two important optimizations for good CPU performance:**
+# 1. Increase the cache hit rate of memory access. Both complex numerical
+#    computation and hot-spot memory access can be accelerated by a high cache hit
+#    rate. This requires us to transform the origin memory access pattern to a pattern that fits the cache policy.
+# 2. SIMD (Single instruction multi-data), also known as the vector processing
+#    unit. On each cycle instead of processing a single value, SIMD can process a small batch of data.
+#    This requires us to transform the data access pattern in the loop
+#    body in uniform pattern so that the LLVM backend can lower it to SIMD.
+#
+# The techniques used in this tutorial are a subset of tricks mentioned in this
+# `repository <https://github.com/flame/how-to-optimize-gemm>`_. Some of them
+# have been applied by TVM abstraction automatically, but some of them cannot
+# be automatically applied due to TVM constraints.
+#
+# All the experiment results mentioned below are executed on 2015 15" MacBook
+# equipped with Intel i7-4770HQ CPU. The cache line size should be 64 bytes for

Review comment:
       It's better to update the hardware since it is a 6-year old mobile CPU

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -163,52 +145,156 @@
 fadd(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Inspect the Generated Code
-# --------------------------
-# You can inspect the generated code in TVM. The result of tvm.build
-# is a TVM Module. fadd is the host module that contains the host wrapper,
-# it also contains a device module for the CUDA (GPU) function.
-#
-# The following code fetches the device module and prints the content code.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    dev_module = fadd.imported_modules[0]
-    print("-----GPU code-----")
-    print(dev_module.get_source())
-else:
-    print(fadd.get_source())
+################################################################################
+# Updating the Schedule to Use Paralleism
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# Now that we've illustrated the fundamentals of TE, let's go deeper into what
+# schedules do, and how they can be used to optimize tensor expressions for
+# different architectures. A schedule is a series of steps that are applied to
+# an expression to transform it in a number of different ways. When a schedule
+# is applied to an expression in TE, the inputs and outputs remain the same,
+# but when compiled the implementation of the expression can change. This
+# tensor addition, in the default schedule, is run serially but is easy to
+# parallelize across all of the processor threads. We can apply the parallel
+# schedule operation to our computation.
 
-######################################################################
-# .. note:: Code Specialization
-#
-#   As you may have noticed, the declarations of A, B and C all
-#   take the same shape argument, n. TVM will take advantage of this
-#   to pass only a single shape argument to the kernel, as you will find in
-#   the printed device code. This is one form of specialization.
-#
-#   On the host side, TVM will automatically generate check code
-#   that checks the constraints in the parameters. So if you pass
-#   arrays with different shapes into fadd, an error will be raised.
-#
-#   We can do more specializations. For example, we can write
-#   :code:`n = tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`,
-#   in the computation declaration. The generated function will
-#   only take vectors with length 1024.
-#
+s[C].parallel(C.op.axis[0])
 
-######################################################################
-# Save Compiled Module
-# --------------------
-# Besides runtime compilation, we can save the compiled modules into
-# a file and load them back later. This is called ahead of time compilation.
+################################################################################
+# The ``tvm.lower`` command will generate the Intermediate Representation (IR)
+# of the TE, with the corresponding schedule. By lowering the expression as we
+# apply different schedule operations, we can see the effect of scheduling on
+# the ordering of the computation.
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# It's now possible for TVM to run these blocks on independent threads. Let's
+# compile and run this new schedule with the parallel operation applied:
+
+fadd_parallel = tvm.build(s, [A, B, C], tgt, target_host=tgt_host, name="myadd_parallel")
+fadd_parallel(a, b, c)
+
+tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
+
+################################################################################
+# Updating the Schedule to Use Vectorization
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Modern CPUs also have the ability to perform SIMD operations on floating
+# point values, and we can apply another schedule to our computation expression
+# to take advantage of this. Accomplishing this requires multiple steps: first
+# we have to split the schedule into inner and outer loops using the split
+# scheduling primitive. The inner loops can use vectorization to use SIMD
+# instructions using the vectorize scheduling primitive, then the outer loops
+# can be parallelized using the parallel scheduling primitive. Choose the split
+# factor to be the number of threads on your CPU.
+
+# Recreate the schedule, since we modified it with the parallel operation in the previous example
+n = te.var("n")
+A = te.placeholder((n,), name="A")
+B = te.placeholder((n,), name="B")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+
+s = te.create_schedule(C.op)
+
+factor = 4
+
+outer, inner = s[C].split(C.op.axis[0], factor=factor)
+s[C].parallel(outer)
+s[C].vectorize(inner)
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# We've defined, scheduled, and compiled a vector addition operator, which we
+# were then able to execute on the TVM runtime. We can save the operator as a
+# library, which we can then load later using the TVM runtime.
+
+################################################################################
+# Targeting Vector Addition for GPUs (Optional)
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# TVM is capable of targeting multiple architectures. In the next example, we
+# will target compilation of the vector addition to GPUs
+
+# If you want to run this code, change ``run_cuda = True``
+run_cuda = False
+if run_cuda:
+
+    # Change this target to the correct backend for you gpu. For example: cuda (NVIDIA GPUs),
+    # rocm (Radeon GPUS), OpenCL (opencl).
+    tgt_gpu = "cuda"
+
+    # Recreate the schedule
+    n = te.var("n")
+    A = te.placeholder((n,), name="A")
+    B = te.placeholder((n,), name="B")
+    C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+    print(type(C))

Review comment:
       No need to print type here agaign

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -163,52 +145,156 @@
 fadd(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Inspect the Generated Code
-# --------------------------
-# You can inspect the generated code in TVM. The result of tvm.build
-# is a TVM Module. fadd is the host module that contains the host wrapper,
-# it also contains a device module for the CUDA (GPU) function.
-#
-# The following code fetches the device module and prints the content code.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    dev_module = fadd.imported_modules[0]
-    print("-----GPU code-----")
-    print(dev_module.get_source())
-else:
-    print(fadd.get_source())
+################################################################################
+# Updating the Schedule to Use Paralleism
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# Now that we've illustrated the fundamentals of TE, let's go deeper into what
+# schedules do, and how they can be used to optimize tensor expressions for
+# different architectures. A schedule is a series of steps that are applied to
+# an expression to transform it in a number of different ways. When a schedule
+# is applied to an expression in TE, the inputs and outputs remain the same,
+# but when compiled the implementation of the expression can change. This
+# tensor addition, in the default schedule, is run serially but is easy to
+# parallelize across all of the processor threads. We can apply the parallel
+# schedule operation to our computation.
 
-######################################################################
-# .. note:: Code Specialization
-#
-#   As you may have noticed, the declarations of A, B and C all
-#   take the same shape argument, n. TVM will take advantage of this
-#   to pass only a single shape argument to the kernel, as you will find in
-#   the printed device code. This is one form of specialization.
-#
-#   On the host side, TVM will automatically generate check code
-#   that checks the constraints in the parameters. So if you pass
-#   arrays with different shapes into fadd, an error will be raised.
-#
-#   We can do more specializations. For example, we can write
-#   :code:`n = tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`,
-#   in the computation declaration. The generated function will
-#   only take vectors with length 1024.
-#
+s[C].parallel(C.op.axis[0])
 
-######################################################################
-# Save Compiled Module
-# --------------------
-# Besides runtime compilation, we can save the compiled modules into
-# a file and load them back later. This is called ahead of time compilation.
+################################################################################
+# The ``tvm.lower`` command will generate the Intermediate Representation (IR)
+# of the TE, with the corresponding schedule. By lowering the expression as we
+# apply different schedule operations, we can see the effect of scheduling on
+# the ordering of the computation.
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# It's now possible for TVM to run these blocks on independent threads. Let's
+# compile and run this new schedule with the parallel operation applied:
+
+fadd_parallel = tvm.build(s, [A, B, C], tgt, target_host=tgt_host, name="myadd_parallel")
+fadd_parallel(a, b, c)
+
+tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())

Review comment:
       Can we compare the running time with naive schedule and numpy?

##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -163,52 +145,156 @@
 fadd(a, b, c)
 tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Inspect the Generated Code
-# --------------------------
-# You can inspect the generated code in TVM. The result of tvm.build
-# is a TVM Module. fadd is the host module that contains the host wrapper,
-# it also contains a device module for the CUDA (GPU) function.
-#
-# The following code fetches the device module and prints the content code.
-#
-if tgt == "cuda" or tgt == "rocm" or tgt.startswith("opencl"):
-    dev_module = fadd.imported_modules[0]
-    print("-----GPU code-----")
-    print(dev_module.get_source())
-else:
-    print(fadd.get_source())
+################################################################################
+# Updating the Schedule to Use Paralleism
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# Now that we've illustrated the fundamentals of TE, let's go deeper into what
+# schedules do, and how they can be used to optimize tensor expressions for
+# different architectures. A schedule is a series of steps that are applied to
+# an expression to transform it in a number of different ways. When a schedule
+# is applied to an expression in TE, the inputs and outputs remain the same,
+# but when compiled the implementation of the expression can change. This
+# tensor addition, in the default schedule, is run serially but is easy to
+# parallelize across all of the processor threads. We can apply the parallel
+# schedule operation to our computation.
 
-######################################################################
-# .. note:: Code Specialization
-#
-#   As you may have noticed, the declarations of A, B and C all
-#   take the same shape argument, n. TVM will take advantage of this
-#   to pass only a single shape argument to the kernel, as you will find in
-#   the printed device code. This is one form of specialization.
-#
-#   On the host side, TVM will automatically generate check code
-#   that checks the constraints in the parameters. So if you pass
-#   arrays with different shapes into fadd, an error will be raised.
-#
-#   We can do more specializations. For example, we can write
-#   :code:`n = tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`,
-#   in the computation declaration. The generated function will
-#   only take vectors with length 1024.
-#
+s[C].parallel(C.op.axis[0])
 
-######################################################################
-# Save Compiled Module
-# --------------------
-# Besides runtime compilation, we can save the compiled modules into
-# a file and load them back later. This is called ahead of time compilation.
+################################################################################
+# The ``tvm.lower`` command will generate the Intermediate Representation (IR)
+# of the TE, with the corresponding schedule. By lowering the expression as we
+# apply different schedule operations, we can see the effect of scheduling on
+# the ordering of the computation.
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# It's now possible for TVM to run these blocks on independent threads. Let's
+# compile and run this new schedule with the parallel operation applied:
+
+fadd_parallel = tvm.build(s, [A, B, C], tgt, target_host=tgt_host, name="myadd_parallel")
+fadd_parallel(a, b, c)
+
+tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
+
+################################################################################
+# Updating the Schedule to Use Vectorization
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Modern CPUs also have the ability to perform SIMD operations on floating
+# point values, and we can apply another schedule to our computation expression
+# to take advantage of this. Accomplishing this requires multiple steps: first
+# we have to split the schedule into inner and outer loops using the split
+# scheduling primitive. The inner loops can use vectorization to use SIMD
+# instructions using the vectorize scheduling primitive, then the outer loops
+# can be parallelized using the parallel scheduling primitive. Choose the split
+# factor to be the number of threads on your CPU.
+
+# Recreate the schedule, since we modified it with the parallel operation in the previous example
+n = te.var("n")
+A = te.placeholder((n,), name="A")
+B = te.placeholder((n,), name="B")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+
+s = te.create_schedule(C.op)
+
+factor = 4
+
+outer, inner = s[C].split(C.op.axis[0], factor=factor)
+s[C].parallel(outer)
+s[C].vectorize(inner)
+
+print(tvm.lower(s, [A, B, C], simple_mode=True))
+
+################################################################################
+# We've defined, scheduled, and compiled a vector addition operator, which we
+# were then able to execute on the TVM runtime. We can save the operator as a
+# library, which we can then load later using the TVM runtime.
+
+################################################################################
+# Targeting Vector Addition for GPUs (Optional)
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# TVM is capable of targeting multiple architectures. In the next example, we
+# will target compilation of the vector addition to GPUs
+
+# If you want to run this code, change ``run_cuda = True``
+run_cuda = False
+if run_cuda:
+
+    # Change this target to the correct backend for you gpu. For example: cuda (NVIDIA GPUs),
+    # rocm (Radeon GPUS), OpenCL (opencl).
+    tgt_gpu = "cuda"
+
+    # Recreate the schedule
+    n = te.var("n")
+    A = te.placeholder((n,), name="A")
+    B = te.placeholder((n,), name="B")
+    C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+    print(type(C))
+
+    s = te.create_schedule(C.op)
+
+    bx, tx = s[C].split(C.op.axis[0], factor=64)
+
+    ################################################################################
+    # Finally we bind the iteration axis bx and tx to threads in the GPU compute
+    # grid. These are GPU specific constructs that allow us to generate code that
+    # runs on GPU.
+
+    if tgt_gpu == "cuda" or tgt_gpu == "rocm" or tgt_gpu.startswith("opencl"):
+        s[C].bind(bx, te.thread_axis("blockIdx.x"))
+        s[C].bind(tx, te.thread_axis("threadIdx.x"))
+
+    fadd = tvm.build(s, [A, B, C], tgt_gpu, target_host=tgt_host, name="myadd")
+
+    ################################################################################
+    # The compiled TVM function is exposes a concise C API that can be invoked from
+    # any language.
+    #
+    # We provide a minimal array API in python to aid quick testing and
+    # prototyping. The array API is based on the DLPack standard.
+    #
+    # We first create a GPU context. Then tvm.nd.array copies the data to the GPU,
+    # fadd runs the actual computation, and asnumpy() copies the GPU array back to the
+    # CPU (so we can verify correctness).
+
+    ctx = tvm.context(tgt_gpu, 0)
+
+    n = 1024
+    a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), ctx)

Review comment:
       Please mention that we need to copy data to devices before running




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org