You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by tq...@apache.org on 2023/01/07 03:03:40 UTC

[tvm-site] branch asf-site updated: deploying docs (apache/tvm@30abbe98321acf594d2cd0d6b9a7c570471d9264)

This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/tvm-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 684e64ce4e deploying docs (apache/tvm@30abbe98321acf594d2cd0d6b9a7c570471d9264)
684e64ce4e is described below

commit 684e64ce4eb87057c260a198ace0a8f5477ec02d
Author: tvm-bot <95...@users.noreply.github.com>
AuthorDate: Sat Jan 7 03:03:34 2023 +0000

    deploying docs (apache/tvm@30abbe98321acf594d2cd0d6b9a7c570471d9264)
---
 .../tune_relay_cuda.py                             |   1 +
 .../micro_pytorch.ipynb                            |  22 +-
 .../deploy_sparse.ipynb                            |   2 +-
 .../0e2f38fcb1a1fb3e636e5953aa600dee/from_mxnet.py |  10 +-
 .../opt_gemm.ipynb                                 |   2 +-
 .../reduction.ipynb                                |   2 +-
 .../micro_pytorch.py                               |   6 +-
 .../from_paddle.py                                 |   6 +-
 .../tune_alu_vta.ipynb                             |   2 +-
 .../deploy_prequantized_tflite.ipynb               |   2 +-
 .../intrin_math.ipynb                              |   2 +-
 .../matrix_multiply.ipynb                          |   2 +-
 .../from_pytorch.ipynb                             |  22 +-
 .../deploy_model_on_adreno.py                      |   1 +
 .../from_tflite.ipynb                              |  22 +-
 .../auto_scheduler_matmul_x86.ipynb                |   4 +-
 .../tune_sparse_x86.ipynb                          |   2 +-
 .../2a0982f8ca0176cb17713d28286536e4/reduction.py  |   1 +
 .../2a4c6a9cfa43e8afef159a2bf1b99108/install.ipynb |   2 +-
 .../from_oneflow.ipynb                             |  22 +-
 .../autotvm_relay_x86.ipynb                        |  10 +-
 .../micro_tflite.py                                | 128 +---
 .../autotvm_matmul_x86.ipynb                       |   4 +-
 .../deploy_object_detection_pytorch.ipynb          |   4 +-
 .../3a9b1d387f618487c8ccf6b8b78ae179/intro_topi.py |   1 +
 .../from_coreml.py                                 |   7 +-
 .../tensorize.ipynb                                |   2 +-
 .../opt_conv_cuda.py                               |   1 +
 .../relay_quick_start.ipynb                        |   2 +-
 .../deploy_model_on_nano.py                        |   1 +
 .../tensor_expr_get_started.ipynb                  |  12 +-
 .../from_mxnet.ipynb                               |  22 +-
 .../matrix_multiply_opt.ipynb                      |   2 +-
 .../tune_network_arm.ipynb                         |   2 +-
 .../micro_ethosu.ipynb                             |  10 +-
 .../micro_tflite.ipynb                             |  75 +-
 .../tensor_ir_blitz_course.py                      |   1 +
 .../tune_network_mali.ipynb                        |   2 +-
 .../tune_conv2d_layer_cuda.ipynb                   |   2 +-
 .../intro_topi.ipynb                               |   2 +-
 .../deploy_detection.ipynb                         |   2 +-
 .../tune_conv2d_cuda.py                            |   1 +
 .../tune_relay_mobile_gpu.ipynb                    |   2 +-
 .../6e0673ce1f08636c34d0b9a73ea114f7/uma.ipynb     |   2 +-
 .../micro_tvmc.ipynb                               |   2 +-
 .../729378592a96230b4f7be71b44da43a4/scan.ipynb    |   2 +-
 .../tune_conv2d_cuda.ipynb                         |   2 +-
 .../opt_conv_tensorcore.py                         |   1 +
 .../opt_conv_tensorcore.ipynb                      |   2 +-
 .../from_darknet.py                                |   5 +-
 .../deploy_object_detection_pytorch.py             |   4 +-
 .../from_onnx.ipynb                                |  22 +-
 .../deploy_model_on_rasp.ipynb                     |   2 +-
 .../micro_reference_vm.ipynb                       |   2 +-
 .../use_pass_infra.ipynb                           |   2 +-
 .../from_tensorflow.py                             |   5 +
 .../build_gcn.ipynb                                |  22 +-
 .../vta_get_started.ipynb                          |   2 +-
 .../from_tensorflow.ipynb                          |  22 +-
 .../extern_op.ipynb                                |   2 +-
 .../opt_conv_cuda.ipynb                            |   2 +-
 .../8c7d8fd6a4b93bcff1f5573943dd02f4/scan.py       |   1 +
 .../tvmc_python.ipynb                              |   2 +-
 .../tune_relay_x86.ipynb                           |   2 +-
 .../deploy_classification.ipynb                    |   2 +-
 .../micro_autotune.py                              |  26 +-
 .../introduction.ipynb                             |   2 +-
 .../tuple_inputs.ipynb                             |   2 +-
 .../deploy_quantized.ipynb                         |   2 +-
 .../from_paddle.ipynb                              |  22 +-
 .../from_tflite.py                                 |   5 +-
 .../a7aff5918e1b86809a5bd1da8bef7229/tedd.ipynb    |   2 +-
 .../micro_train.ipynb                              |   8 +-
 .../from_coreml.ipynb                              |  22 +-
 .../tune_network_x86.ipynb                         |   2 +-
 .../tune_network_cuda.ipynb                        |   2 +-
 .../bring_your_own_datatypes.ipynb                 |   4 +-
 .../tune_relay_vta.ipynb                           |   2 +-
 .../convolution_opt.ipynb                          |   2 +-
 .../micro_train.py                                 |  15 +-
 .../schedule_primitives.ipynb                      |   2 +-
 .../using_relay_viz.ipynb                          |  22 +-
 .../deploy_model_on_adreno.ipynb                   |  22 +-
 .../tune_relay_arm.ipynb                           |   2 +-
 .../micro_aot.ipynb                                |  80 +-
 .../deploy_prequantized.ipynb                      |   2 +-
 .../c23f7654585d9b0fa2129e1765b2a8f2/from_keras.py |   7 +-
 .../from_keras.ipynb                               |  22 +-
 .../tensor_ir_blitz_course.ipynb                   |   2 +-
 .../deploy_model_on_nano.ipynb                     |   2 +-
 .../using_relay_viz.py                             |   7 +
 .../relay_quick_start.py                           |   1 +
 .../tune_relay_cuda.ipynb                          |   2 +-
 .../low_level_custom_pass.ipynb                    |   2 +-
 .../deploy_ssd_gluoncv.ipynb                       |   2 +-
 .../dabb6b43ea9ef9d7bd1a3912001deace/build_gcn.py  |   8 +-
 .../tune_conv2d_layer_cuda.py                      |   1 +
 .../eb551cfff8900ec35fae9f15aa728e45/from_onnx.py  |   9 +-
 .../using_external_lib.ipynb                       |   2 +-
 .../bring_your_own_datatypes.py                    |   2 +-
 .../deploy_model_on_android.ipynb                  |   2 +-
 .../tvmc_command_line_driver.ipynb                 |  14 +-
 .../cross_compilation_and_rpc.ipynb                |   2 +-
 .../using_pipeline_executor.ipynb                  |   2 +-
 .../use_pass_instrument.ipynb                      |   2 +-
 .../from_oneflow.py                                |   4 +-
 .../micro_autotune.ipynb                           |  90 ++-
 .../f8a7209a0e66b246185bfc41bbc82f54/micro_aot.py  |  31 +-
 .../from_pytorch.py                                |  12 +-
 .../from_darknet.ipynb                             |  15 +-
 docs/_images/sphx_glr_micro_train_001.png          | Bin 324292 -> 335230 bytes
 docs/_images/sphx_glr_micro_train_thumb.png        | Bin 23851 -> 23974 bytes
 .../how_to/compile_models/from_coreml.rst.txt      |  45 +-
 .../how_to/compile_models/from_darknet.rst.txt     |  44 +-
 .../how_to/compile_models/from_keras.rst.txt       |  44 +-
 .../how_to/compile_models/from_mxnet.rst.txt       |  53 +-
 .../how_to/compile_models/from_oneflow.rst.txt     |  47 +-
 .../how_to/compile_models/from_onnx.rst.txt        |  53 +-
 .../how_to/compile_models/from_paddle.rst.txt      |  22 +-
 .../how_to/compile_models/from_pytorch.rst.txt     |  30 +-
 .../how_to/compile_models/from_tensorflow.rst.txt  |  61 +-
 .../how_to/compile_models/from_tflite.rst.txt      |  45 +-
 docs/_sources/how_to/compile_models/index.rst.txt  |   6 +-
 .../compile_models/sg_execution_times.rst.txt      |  22 +-
 .../deploy_models/deploy_model_on_adreno.rst.txt   |  61 +-
 .../deploy_models/deploy_model_on_android.rst.txt  |  18 +-
 .../deploy_models/deploy_model_on_nano.rst.txt     |  50 +-
 .../deploy_models/deploy_model_on_rasp.rst.txt     |  16 +-
 .../deploy_object_detection_pytorch.rst.txt        |  24 +-
 .../deploy_models/deploy_prequantized.rst.txt      |  22 +-
 .../deploy_prequantized_tflite.rst.txt             |  20 +-
 .../how_to/deploy_models/deploy_quantized.rst.txt  |  18 +-
 .../how_to/deploy_models/deploy_sparse.rst.txt     |  16 +-
 .../deploy_models/deploy_ssd_gluoncv.rst.txt       |  20 +-
 .../deploy_models/sg_execution_times.rst.txt       |  20 +-
 .../extend_tvm/bring_your_own_datatypes.rst.txt    |  20 +-
 .../extend_tvm/low_level_custom_pass.rst.txt       |  16 +-
 .../how_to/extend_tvm/sg_execution_times.rst.txt   |   8 +-
 .../how_to/extend_tvm/use_pass_infra.rst.txt       |  16 +-
 .../how_to/extend_tvm/use_pass_instrument.rst.txt  |  32 +-
 .../optimize_operators/opt_conv_cuda.rst.txt       |  42 +-
 .../optimize_operators/opt_conv_tensorcore.rst.txt |  40 +-
 .../how_to/optimize_operators/opt_gemm.rst.txt     |  32 +-
 .../optimize_operators/sg_execution_times.rst.txt  |   8 +-
 .../sg_execution_times.rst.txt                     |  14 +-
 .../tune_conv2d_layer_cuda.rst.txt                 | 808 +++++++--------------
 .../tune_network_arm.rst.txt                       |  16 +-
 .../tune_network_cuda.rst.txt                      |  20 +-
 .../tune_network_mali.rst.txt                      |  16 +-
 .../tune_network_x86.rst.txt                       |  20 +-
 .../tune_sparse_x86.rst.txt                        |  44 +-
 .../tune_with_autotvm/sg_execution_times.rst.txt   |  10 +-
 .../tune_with_autotvm/tune_conv2d_cuda.rst.txt     | 721 ++++++++++++++++--
 .../tune_with_autotvm/tune_relay_arm.rst.txt       |  16 +-
 .../tune_with_autotvm/tune_relay_cuda.rst.txt      |  44 +-
 .../tune_relay_mobile_gpu.rst.txt                  |  16 +-
 .../tune_with_autotvm/tune_relay_x86.rst.txt       |  16 +-
 .../how_to/work_with_microtvm/micro_aot.rst.txt    |  82 ++-
 .../work_with_microtvm/micro_autotune.rst.txt      | 101 ++-
 .../how_to/work_with_microtvm/micro_ethosu.rst.txt |  16 +-
 .../work_with_microtvm/micro_pytorch.rst.txt       |  52 +-
 .../work_with_microtvm/micro_reference_vm.rst.txt  |  16 +-
 .../how_to/work_with_microtvm/micro_tflite.rst.txt | 148 ++--
 .../how_to/work_with_microtvm/micro_train.rst.txt  | 109 ++-
 .../how_to/work_with_microtvm/micro_tvmc.rst.txt   |  16 +-
 .../work_with_microtvm/sg_execution_times.rst.txt  |  12 +-
 .../how_to/work_with_relay/build_gcn.rst.txt       |  64 +-
 .../work_with_relay/sg_execution_times.rst.txt     |   8 +-
 .../work_with_relay/using_external_lib.rst.txt     |  16 +-
 .../using_pipeline_executor.rst.txt                |  16 +-
 .../how_to/work_with_relay/using_relay_viz.rst.txt |  47 +-
 .../how_to/work_with_schedules/extern_op.rst.txt   |  16 +-
 .../how_to/work_with_schedules/intrin_math.rst.txt |  18 +-
 .../how_to/work_with_schedules/reduction.rst.txt   |  60 +-
 .../how_to/work_with_schedules/scan.rst.txt        |  42 +-
 .../schedule_primitives.rst.txt                    |  16 +-
 .../work_with_schedules/sg_execution_times.rst.txt |  14 +-
 .../how_to/work_with_schedules/tedd.rst.txt        |  16 +-
 .../how_to/work_with_schedules/tensorize.rst.txt   |  18 +-
 .../work_with_schedules/tuple_inputs.rst.txt       |  16 +-
 .../tutorials/autotvm/sg_execution_times.rst.txt   |   4 +-
 .../vta/tutorials/autotvm/tune_alu_vta.rst.txt     |  16 +-
 .../vta/tutorials/autotvm/tune_relay_vta.rst.txt   |  16 +-
 .../frontend/deploy_classification.rst.txt         |  18 +-
 .../tutorials/frontend/deploy_detection.rst.txt    |  18 +-
 .../tutorials/frontend/sg_execution_times.rst.txt  |   6 +-
 .../topic/vta/tutorials/matrix_multiply.rst.txt    |  16 +-
 .../vta/tutorials/optimize/convolution_opt.rst.txt |  16 +-
 .../tutorials/optimize/matrix_multiply_opt.rst.txt |  16 +-
 .../tutorials/optimize/sg_execution_times.rst.txt  |   6 +-
 .../topic/vta/tutorials/sg_execution_times.rst.txt |   6 +-
 .../topic/vta/tutorials/vta_get_started.rst.txt    |  16 +-
 .../tutorial/auto_scheduler_matmul_x86.rst.txt     |  29 +-
 docs/_sources/tutorial/autotvm_matmul_x86.rst.txt  |  36 +-
 docs/_sources/tutorial/autotvm_relay_x86.rst.txt   |  70 +-
 .../tutorial/cross_compilation_and_rpc.rst.txt     |  18 +-
 docs/_sources/tutorial/install.rst.txt             |  16 +-
 docs/_sources/tutorial/intro_topi.rst.txt          |  58 +-
 docs/_sources/tutorial/introduction.rst.txt        |  16 +-
 docs/_sources/tutorial/relay_quick_start.rst.txt   |  38 +-
 docs/_sources/tutorial/sg_execution_times.rst.txt  |  26 +-
 .../tutorial/tensor_expr_get_started.rst.txt       |  61 +-
 .../tutorial/tensor_ir_blitz_course.rst.txt        |  54 +-
 .../tutorial/tvmc_command_line_driver.rst.txt      |  16 +-
 docs/_sources/tutorial/tvmc_python.rst.txt         |  16 +-
 docs/_sources/tutorial/uma.rst.txt                 |  16 +-
 docs/commit_hash                                   |   2 +-
 docs/how_to/compile_models/from_coreml.html        |  10 +-
 docs/how_to/compile_models/from_darknet.html       |  10 +-
 docs/how_to/compile_models/from_keras.html         |  12 +-
 docs/how_to/compile_models/from_mxnet.html         |  13 +-
 docs/how_to/compile_models/from_oneflow.html       |  19 +-
 docs/how_to/compile_models/from_onnx.html          |  16 +-
 docs/how_to/compile_models/from_paddle.html        |  10 +-
 docs/how_to/compile_models/from_pytorch.html       |  23 +-
 docs/how_to/compile_models/from_tensorflow.html    |  10 +-
 docs/how_to/compile_models/from_tflite.html        |   8 +-
 docs/how_to/compile_models/index.html              |   6 +-
 docs/how_to/compile_models/sg_execution_times.html |  22 +-
 .../deploy_models/deploy_model_on_adreno.html      |   9 +-
 .../deploy_models/deploy_model_on_android.html     |   7 +-
 .../how_to/deploy_models/deploy_model_on_nano.html |   5 +-
 .../how_to/deploy_models/deploy_model_on_rasp.html |   5 +-
 .../deploy_object_detection_pytorch.html           |  46 +-
 docs/how_to/deploy_models/deploy_prequantized.html |  12 +-
 .../deploy_models/deploy_prequantized_tflite.html  |   9 +-
 docs/how_to/deploy_models/deploy_quantized.html    |   7 +-
 docs/how_to/deploy_models/deploy_sparse.html       |   5 +-
 docs/how_to/deploy_models/deploy_ssd_gluoncv.html  |  42 +-
 docs/how_to/deploy_models/sg_execution_times.html  |  20 +-
 .../extend_tvm/bring_your_own_datatypes.html       |   7 +-
 docs/how_to/extend_tvm/low_level_custom_pass.html  |   5 +-
 docs/how_to/extend_tvm/sg_execution_times.html     |   8 +-
 docs/how_to/extend_tvm/use_pass_infra.html         |   5 +-
 docs/how_to/extend_tvm/use_pass_instrument.html    |  21 +-
 docs/how_to/optimize_operators/opt_conv_cuda.html  |   7 +-
 .../optimize_operators/opt_conv_tensorcore.html    |   7 +-
 docs/how_to/optimize_operators/opt_gemm.html       |  21 +-
 .../optimize_operators/sg_execution_times.html     |   8 +-
 .../sg_execution_times.html                        |  14 +-
 .../tune_conv2d_layer_cuda.html                    | 763 +++++++------------
 .../tune_with_autoscheduler/tune_network_arm.html  |   5 +-
 .../tune_with_autoscheduler/tune_network_cuda.html |   9 +-
 .../tune_with_autoscheduler/tune_network_mali.html |   5 +-
 .../tune_with_autoscheduler/tune_network_x86.html  |   9 +-
 .../tune_with_autoscheduler/tune_sparse_x86.html   |  33 +-
 .../tune_with_autotvm/sg_execution_times.html      |  10 +-
 .../how_to/tune_with_autotvm/tune_conv2d_cuda.html | 698 +++++++++++++++++-
 docs/how_to/tune_with_autotvm/tune_relay_arm.html  |   5 +-
 docs/how_to/tune_with_autotvm/tune_relay_cuda.html |   5 +-
 .../tune_with_autotvm/tune_relay_mobile_gpu.html   |   5 +-
 docs/how_to/tune_with_autotvm/tune_relay_x86.html  |   5 +-
 docs/how_to/work_with_microtvm/micro_aot.html      |  88 ++-
 docs/how_to/work_with_microtvm/micro_autotune.html | 122 +++-
 docs/how_to/work_with_microtvm/micro_ethosu.html   |   5 +-
 docs/how_to/work_with_microtvm/micro_pytorch.html  |  23 +-
 .../work_with_microtvm/micro_reference_vm.html     |   5 +-
 docs/how_to/work_with_microtvm/micro_tflite.html   | 157 ++--
 docs/how_to/work_with_microtvm/micro_train.html    |  41 +-
 docs/how_to/work_with_microtvm/micro_tvmc.html     |   5 +-
 .../work_with_microtvm/sg_execution_times.html     |  12 +-
 docs/how_to/work_with_relay/build_gcn.html         |  11 +-
 .../how_to/work_with_relay/sg_execution_times.html |   8 +-
 .../how_to/work_with_relay/using_external_lib.html |   5 +-
 .../work_with_relay/using_pipeline_executor.html   |   5 +-
 docs/how_to/work_with_relay/using_relay_viz.html   |  11 +-
 docs/how_to/work_with_schedules/extern_op.html     |   5 +-
 docs/how_to/work_with_schedules/intrin_math.html   |   7 +-
 docs/how_to/work_with_schedules/reduction.html     |   5 +-
 docs/how_to/work_with_schedules/scan.html          |   5 +-
 .../work_with_schedules/schedule_primitives.html   |   5 +-
 .../work_with_schedules/sg_execution_times.html    |  14 +-
 docs/how_to/work_with_schedules/tedd.html          |   5 +-
 docs/how_to/work_with_schedules/tensorize.html     |   7 +-
 docs/how_to/work_with_schedules/tuple_inputs.html  |   5 +-
 docs/install/nnpack.html                           |  12 +-
 docs/reference/api/python/auto_scheduler.html      |   4 +-
 .../api/typedoc/classes/bytestreamreader.html      |  12 +-
 .../api/typedoc/classes/cachedcallstack.html       |  34 +-
 docs/reference/api/typedoc/classes/dldatatype.html |  12 +-
 docs/reference/api/typedoc/classes/dldevice.html   |  10 +-
 .../reference/api/typedoc/classes/environment.html |  12 +-
 docs/reference/api/typedoc/classes/ffilibrary.html |  20 +-
 .../api/typedoc/classes/graphexecutor.html         |  16 +-
 docs/reference/api/typedoc/classes/instance.html   |  40 +-
 docs/reference/api/typedoc/classes/memory.html     |  34 +-
 docs/reference/api/typedoc/classes/module.html     |  10 +-
 docs/reference/api/typedoc/classes/ndarray.html    |  22 +-
 .../api/typedoc/classes/packedfunccell.html        |   6 +-
 docs/reference/api/typedoc/classes/rpcserver.html  |  14 +-
 docs/reference/api/typedoc/classes/scalar.html     |   6 +-
 .../api/typedoc/classes/webgpucontext.html         |  12 +-
 docs/reference/api/typedoc/enums/argtypecode.html  |  30 +-
 .../api/typedoc/enums/aynccallbackcode.html        |   4 +-
 .../api/typedoc/enums/dldatatypecode.html          |   8 +-
 .../api/typedoc/enums/rpcserverstate.html          |  12 +-
 docs/reference/api/typedoc/enums/sizeof.html       |  18 +-
 docs/reference/api/typedoc/index.html              | 112 +--
 .../api/typedoc/interfaces/disposable.html         |   2 +-
 .../api/typedoc/interfaces/functioninfo.html       |   6 +-
 .../api/typedoc/interfaces/libraryprovider.html    |   4 +-
 docs/searchindex.js                                |   2 +-
 .../vta/tutorials/autotvm/sg_execution_times.html  |   4 +-
 docs/topic/vta/tutorials/autotvm/tune_alu_vta.html |   5 +-
 .../vta/tutorials/autotvm/tune_relay_vta.html      |   5 +-
 .../tutorials/frontend/deploy_classification.html  |   7 +-
 .../vta/tutorials/frontend/deploy_detection.html   |   7 +-
 .../vta/tutorials/frontend/sg_execution_times.html |   6 +-
 docs/topic/vta/tutorials/matrix_multiply.html      |   5 +-
 .../vta/tutorials/optimize/convolution_opt.html    |   5 +-
 .../tutorials/optimize/matrix_multiply_opt.html    |   5 +-
 .../vta/tutorials/optimize/sg_execution_times.html |   6 +-
 docs/topic/vta/tutorials/sg_execution_times.html   |   6 +-
 docs/topic/vta/tutorials/vta_get_started.html      |   5 +-
 docs/tutorial/auto_scheduler_matmul_x86.html       |  13 +-
 docs/tutorial/autotvm_matmul_x86.html              |  25 +-
 docs/tutorial/autotvm_relay_x86.html               | 269 +++----
 docs/tutorial/cross_compilation_and_rpc.html       |   7 +-
 docs/tutorial/install.html                         |   5 +-
 docs/tutorial/intro_topi.html                      |   7 +-
 docs/tutorial/introduction.html                    |   5 +-
 docs/tutorial/relay_quick_start.html               |   5 +-
 docs/tutorial/sg_execution_times.html              |  26 +-
 docs/tutorial/tensor_expr_get_started.html         |  46 +-
 docs/tutorial/tensor_ir_blitz_course.html          |   5 +-
 docs/tutorial/tvmc_command_line_driver.html        |   5 +-
 docs/tutorial/tvmc_python.html                     |   5 +-
 docs/tutorial/uma.html                             |   5 +-
 328 files changed, 5279 insertions(+), 3534 deletions(-)

diff --git a/docs/_downloads/0387f07dee851b2b8c6b73e3e88c3140/tune_relay_cuda.py b/docs/_downloads/0387f07dee851b2b8c6b73e3e88c3140/tune_relay_cuda.py
index 4cf397e256..7cb6cb8dd3 100644
--- a/docs/_downloads/0387f07dee851b2b8c6b73e3e88c3140/tune_relay_cuda.py
+++ b/docs/_downloads/0387f07dee851b2b8c6b73e3e88c3140/tune_relay_cuda.py
@@ -60,6 +60,7 @@ __name__ == "__main__":` block.
 # Now return to python code. Import packages.
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/09df7d9b9c90a2a1bdd570520693fd9f/micro_pytorch.ipynb b/docs/_downloads/09df7d9b9c90a2a1bdd570520693fd9f/micro_pytorch.ipynb
index 32db2a435c..5a3642bdf3 100644
--- a/docs/_downloads/09df7d9b9c90a2a1bdd570520693fd9f/micro_pytorch.ipynb
+++ b/docs/_downloads/09df7d9b9c90a2a1bdd570520693fd9f/micro_pytorch.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
@@ -18,6 +18,24 @@
         "\n\n# microTVM PyTorch Tutorial\n**Authors**:\n[Mehrdad Hessar](https://github.com/mehrdadh)\n\nThis tutorial is showcasing microTVM host-driven AoT compilation with\na PyTorch model. This tutorial can be executed on a x86 CPU using C runtime (CRT).\n\n**Note:** This tutorial only runs on x86 CPU using CRT and does not run on Zephyr\nsince the model would not fit on our current supported Zephyr boards.\n"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Install microTVM Python dependencies\n\nTVM does not include a package for Python serial communication, so\nwe must install one before using microTVM. We will also need TFLite\nto load models.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install pyserial==3.5 tflite==2.1"
+      ]
+    },
     {
       "cell_type": "code",
       "execution_count": null,
@@ -26,7 +44,7 @@
       },
       "outputs": [],
       "source": [
-        "import pathlib\n\nimport torch\nimport torchvision\nfrom torchvision import transforms\nimport numpy as np\nfrom PIL import Image\n\nimport tvm\nfrom tvm import relay\nfrom tvm.contrib.download import download_testdata\nfrom tvm.relay.backend import Executor"
+        "import pathlib\nimport torch\nimport torchvision\nfrom torchvision import transforms\nimport numpy as np\nfrom PIL import Image\n\nimport tvm\nfrom tvm import relay\nfrom tvm.contrib.download import download_testdata\nfrom tvm.relay.backend import Executor"
       ]
     },
     {
diff --git a/docs/_downloads/0b60295044fd20226a0d5adc52b50b2f/deploy_sparse.ipynb b/docs/_downloads/0b60295044fd20226a0d5adc52b50b2f/deploy_sparse.ipynb
index bae6ce242b..1e73311150 100644
--- a/docs/_downloads/0b60295044fd20226a0d5adc52b50b2f/deploy_sparse.ipynb
+++ b/docs/_downloads/0b60295044fd20226a0d5adc52b50b2f/deploy_sparse.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/0e2f38fcb1a1fb3e636e5953aa600dee/from_mxnet.py b/docs/_downloads/0e2f38fcb1a1fb3e636e5953aa600dee/from_mxnet.py
index 3808461862..cfd66ecdb7 100644
--- a/docs/_downloads/0e2f38fcb1a1fb3e636e5953aa600dee/from_mxnet.py
+++ b/docs/_downloads/0e2f38fcb1a1fb3e636e5953aa600dee/from_mxnet.py
@@ -22,21 +22,19 @@ Compile MXNet Models
 **Author**: `Joshua Z. Zhang <https://zhreshold.github.io/>`_, \
             `Kazutaka Morita <https://github.com/kazum>`_
 
-This article is an introductory tutorial to deploy mxnet models with Relay.
-
-For us to begin with, mxnet module is required to be installed.
-
-A quick solution is
+This article is an introductory tutorial to deploy mxnet models with Relay. To begin, we must install `mxnet`:
 
 .. code-block:: bash
 
-    pip install mxnet --user
+    %%shell
+    pip install mxnet
 
 or please refer to official installation guide.
 https://mxnet.apache.org/versions/master/install/index.html
 """
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/0f8d36b3ffd04a5a08089dc671eb788e/opt_gemm.ipynb b/docs/_downloads/0f8d36b3ffd04a5a08089dc671eb788e/opt_gemm.ipynb
index 30061793bc..0212b828d4 100644
--- a/docs/_downloads/0f8d36b3ffd04a5a08089dc671eb788e/opt_gemm.ipynb
+++ b/docs/_downloads/0f8d36b3ffd04a5a08089dc671eb788e/opt_gemm.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/10d831d158490a9ee3abd1901806fc11/reduction.ipynb b/docs/_downloads/10d831d158490a9ee3abd1901806fc11/reduction.ipynb
index f9cefac27a..57e2871dcb 100644
--- a/docs/_downloads/10d831d158490a9ee3abd1901806fc11/reduction.ipynb
+++ b/docs/_downloads/10d831d158490a9ee3abd1901806fc11/reduction.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
diff --git a/docs/_downloads/12b9ecc04c41abaa12022061771821d1/micro_pytorch.py b/docs/_downloads/12b9ecc04c41abaa12022061771821d1/micro_pytorch.py
index cd4af05fb5..f7f0c9209a 100644
--- a/docs/_downloads/12b9ecc04c41abaa12022061771821d1/micro_pytorch.py
+++ b/docs/_downloads/12b9ecc04c41abaa12022061771821d1/micro_pytorch.py
@@ -29,6 +29,11 @@ a PyTorch model. This tutorial can be executed on a x86 CPU using C runtime (CRT
 since the model would not fit on our current supported Zephyr boards.
 """
 
+######################################################################
+#
+#     .. include:: ../../../../gallery/how_to/work_with_microtvm/install_dependencies.rst
+#
+
 # sphinx_gallery_start_ignore
 from tvm import testing
 
@@ -36,7 +41,6 @@ testing.utils.install_request_hook(depth=3)
 # sphinx_gallery_end_ignore
 
 import pathlib
-
 import torch
 import torchvision
 from torchvision import transforms
diff --git a/docs/_downloads/16269b77359771348d507395692524cf/from_paddle.py b/docs/_downloads/16269b77359771348d507395692524cf/from_paddle.py
index fecb1c48da..199547b814 100644
--- a/docs/_downloads/16269b77359771348d507395692524cf/from_paddle.py
+++ b/docs/_downloads/16269b77359771348d507395692524cf/from_paddle.py
@@ -20,14 +20,14 @@ Compile PaddlePaddle Models
 **Author**: `Ziyuan Ma <https://github.com/ZiyuanMa/>`_
 
 This article is an introductory tutorial to deploy PaddlePaddle models with Relay.
-For us to begin with, PaddlePaddle>=2.1.3 is required to be installed.
-A quick solution is
+To begin, we'll install PaddlePaddle>=2.1.3:
 
 .. code-block:: bash
 
+    %%shell
     pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
 
-or please refer to official site.
+For more details, refer to the official install instructions at:
 https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html
 """
 
diff --git a/docs/_downloads/178b6f23dffc01ac92f2cf95f41a5679/tune_alu_vta.ipynb b/docs/_downloads/178b6f23dffc01ac92f2cf95f41a5679/tune_alu_vta.ipynb
index c681b0c7f2..c0a2198da7 100644
--- a/docs/_downloads/178b6f23dffc01ac92f2cf95f41a5679/tune_alu_vta.ipynb
+++ b/docs/_downloads/178b6f23dffc01ac92f2cf95f41a5679/tune_alu_vta.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/1a26d790f7b98309d730181290dae3ee/deploy_prequantized_tflite.ipynb b/docs/_downloads/1a26d790f7b98309d730181290dae3ee/deploy_prequantized_tflite.ipynb
index f7fc826ff2..93b9d89091 100644
--- a/docs/_downloads/1a26d790f7b98309d730181290dae3ee/deploy_prequantized_tflite.ipynb
+++ b/docs/_downloads/1a26d790f7b98309d730181290dae3ee/deploy_prequantized_tflite.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/1e482ba1190961191e3a0bdbd0585faa/intrin_math.ipynb b/docs/_downloads/1e482ba1190961191e3a0bdbd0585faa/intrin_math.ipynb
index 553aef0b20..6c816142e5 100644
--- a/docs/_downloads/1e482ba1190961191e3a0bdbd0585faa/intrin_math.ipynb
+++ b/docs/_downloads/1e482ba1190961191e3a0bdbd0585faa/intrin_math.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/1ee0b869c5082223c5dfbb0fe4574252/matrix_multiply.ipynb b/docs/_downloads/1ee0b869c5082223c5dfbb0fe4574252/matrix_multiply.ipynb
index cb123aab32..a7938aa85d 100644
--- a/docs/_downloads/1ee0b869c5082223c5dfbb0fe4574252/matrix_multiply.ipynb
+++ b/docs/_downloads/1ee0b869c5082223c5dfbb0fe4574252/matrix_multiply.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/1f4943aed1aa607b2775c18b1d71db10/from_pytorch.ipynb b/docs/_downloads/1f4943aed1aa607b2775c18b1d71db10/from_pytorch.ipynb
index 5435282353..a8d9c39eba 100644
--- a/docs/_downloads/1f4943aed1aa607b2775c18b1d71db10/from_pytorch.ipynb
+++ b/docs/_downloads/1f4943aed1aa607b2775c18b1d71db10/from_pytorch.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Compile PyTorch Models\n**Author**: [Alex Wong](https://github.com/alexwong/)\n\nThis article is an introductory tutorial to deploy PyTorch models with Relay.\n\nFor us to begin with, PyTorch should be installed.\nTorchVision is also required since we will be using it as our model zoo.\n\nA quick solution is to install via pip\n\n```bash\npip install torch==1.7.0\npip install torchvision==0.8.1\n```\nor please refer to official site\nhttps://pytorch.org/get-started/locally/\ [...]
+        "\n# Compile PyTorch Models\n**Author**: [Alex Wong](https://github.com/alexwong/)\n\nThis article is an introductory tutorial to deploy PyTorch models with Relay.\n\nFor us to begin, PyTorch should be installed.\nTorchVision is also required so we can use the model zoo.\nA quick solution is to install via pip:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install torch\npip install torchvision"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "or please refer to official site\nhttps://pytorch.org/get-started/locally/\n\nPyTorch versions should be backwards compatible but should be used\nwith the proper TorchVision version.\n\nCurrently, TVM supports PyTorch 1.7 and 1.4. Other versions may\nbe unstable.\n"
       ]
     },
     {
diff --git a/docs/_downloads/2387d8448da213eb625e6b3d916327d4/deploy_model_on_adreno.py b/docs/_downloads/2387d8448da213eb625e6b3d916327d4/deploy_model_on_adreno.py
index d6ed1f1f99..8d25e50b56 100644
--- a/docs/_downloads/2387d8448da213eb625e6b3d916327d4/deploy_model_on_adreno.py
+++ b/docs/_downloads/2387d8448da213eb625e6b3d916327d4/deploy_model_on_adreno.py
@@ -31,6 +31,7 @@ A quick solution is to install it via pip:
 
 .. code-block:: bash
 
+  %%shell
   pip install torch
   pip install torchvision
 
diff --git a/docs/_downloads/23968bb778cd9591b7ad858bf17dcc3e/from_tflite.ipynb b/docs/_downloads/23968bb778cd9591b7ad858bf17dcc3e/from_tflite.ipynb
index 940533d063..9957ace9a6 100644
--- a/docs/_downloads/23968bb778cd9591b7ad858bf17dcc3e/from_tflite.ipynb
+++ b/docs/_downloads/23968bb778cd9591b7ad858bf17dcc3e/from_tflite.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Compile TFLite Models\n**Author**: [Zhao Wu](https://github.com/FrozenGene)\n\nThis article is an introductory tutorial to deploy TFLite models with Relay.\n\nTo get started, TFLite package needs to be installed as prerequisite.\n\n```bash\n# install tflite\npip install tflite==2.1.0 --user\n```\nor you could generate TFLite package yourself. The steps are the following:\n\n```bash\n# Get the flatc compiler.\n# Please refer to https://github.com/google/flatbuffers for detail [...]
+        "\n# Compile TFLite Models\n**Author**: [Zhao Wu](https://github.com/FrozenGene)\n\nThis article is an introductory tutorial to deploy TFLite models with Relay.\n\nTo get started, TFLite package needs to be installed as prerequisite.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install tflite==2.1.0"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "or you could generate TFLite package yourself. The steps are the following:\n\n```bash\n# Get the flatc compiler.\n# Please refer to https://github.com/google/flatbuffers for details\n# and make sure it is properly installed.\nflatc --version\n\n# Get the TFLite schema.\nwget https://raw.githubusercontent.com/tensorflow/tensorflow/r1.13/tensorflow/lite/schema/schema.fbs\n\n# Generate TFLite package.\nflatc --python schema.fbs\n\n# Add current folder (which contains generated tfl [...]
       ]
     },
     {
diff --git a/docs/_downloads/246d4b8509474fd9046e69f6cc9b7f87/auto_scheduler_matmul_x86.ipynb b/docs/_downloads/246d4b8509474fd9046e69f6cc9b7f87/auto_scheduler_matmul_x86.ipynb
index 77876fbe63..3ac5f65419 100644
--- a/docs/_downloads/246d4b8509474fd9046e69f6cc9b7f87/auto_scheduler_matmul_x86.ipynb
+++ b/docs/_downloads/246d4b8509474fd9046e69f6cc9b7f87/auto_scheduler_matmul_x86.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
@@ -51,7 +51,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Create the search task\nWith the function defined, we can now create the task for the auto_scheduler\nto search against. We specify the particular parameters for this matrix\nmultiplication, in this case a multiplication of two square matrices of size\n1024x1024. We then create a search task with N=L=M=1024 and dtype=\"float32\"\n\n.. admonition:: Improve performance with custom targets\n\n  In order for TVM to take full advantage of specific hardware platforms,\n  you will w [...]
+        "## Create the search task\nWith the function defined, we can now create the task for the auto_scheduler\nto search against. We specify the particular parameters for this matrix\nmultiplication, in this case a multiplication of two square matrices of size\n1024x1024. We then create a search task with N=L=M=1024 and dtype=\"float32\"\n\n<div class=\"alert alert-info\"><h4>Improve performance with custom targets</h4><p>In order for TVM to take full advantage of specific hardware pl [...]
       ]
     },
     {
diff --git a/docs/_downloads/293f8d0753933b706a0b588f909fe38a/tune_sparse_x86.ipynb b/docs/_downloads/293f8d0753933b706a0b588f909fe38a/tune_sparse_x86.ipynb
index ff09e6a5ca..a1cd6d0114 100644
--- a/docs/_downloads/293f8d0753933b706a0b588f909fe38a/tune_sparse_x86.ipynb
+++ b/docs/_downloads/293f8d0753933b706a0b588f909fe38a/tune_sparse_x86.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/2a0982f8ca0176cb17713d28286536e4/reduction.py b/docs/_downloads/2a0982f8ca0176cb17713d28286536e4/reduction.py
index 432e9cd143..c084c45d38 100644
--- a/docs/_downloads/2a0982f8ca0176cb17713d28286536e4/reduction.py
+++ b/docs/_downloads/2a0982f8ca0176cb17713d28286536e4/reduction.py
@@ -29,6 +29,7 @@ from __future__ import absolute_import, print_function
 
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/2a4c6a9cfa43e8afef159a2bf1b99108/install.ipynb b/docs/_downloads/2a4c6a9cfa43e8afef159a2bf1b99108/install.ipynb
index 86f13e985c..619216f49d 100644
--- a/docs/_downloads/2a4c6a9cfa43e8afef159a2bf1b99108/install.ipynb
+++ b/docs/_downloads/2a4c6a9cfa43e8afef159a2bf1b99108/install.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/2e7b51cb39c472626dd3f046d9b89966/from_oneflow.ipynb b/docs/_downloads/2e7b51cb39c472626dd3f046d9b89966/from_oneflow.ipynb
index 643a336e19..0b833200f5 100644
--- a/docs/_downloads/2e7b51cb39c472626dd3f046d9b89966/from_oneflow.ipynb
+++ b/docs/_downloads/2e7b51cb39c472626dd3f046d9b89966/from_oneflow.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Compile OneFlow Models\n**Author**: [Xiaoyu Zhang](https://github.com/BBuf/)\n\nThis article is an introductory tutorial to deploy OneFlow models with Relay.\n\nFor us to begin with, OneFlow package should be installed.\n\nA quick solution is to install via pip\n\n```bash\npip install flowvision==0.1.0\npython3 -m pip install -f https://release.oneflow.info oneflow==0.7.0+cpu\n```\nor please refer to official site:\nhttps://github.com/Oneflow-Inc/oneflow\n\nCurrently, TVM su [...]
+        "\n# Compile OneFlow Models\n**Author**: [Xiaoyu Zhang](https://github.com/BBuf/)\n\nThis article is an introductory tutorial to deploy OneFlow models with Relay.\n\nFor us to begin with, OneFlow package should be installed.\n\nA quick solution is to install via pip\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install flowvision==0.1.0\npip install -f https://release.oneflow.info oneflow==0.7.0+cpu"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "or please refer to official site:\nhttps://github.com/Oneflow-Inc/oneflow\n\nCurrently, TVM supports OneFlow 0.7.0. Other versions may be unstable.\n"
       ]
     },
     {
diff --git a/docs/_downloads/2f91b1346a0ba21b800081aa15fdaac2/autotvm_relay_x86.ipynb b/docs/_downloads/2f91b1346a0ba21b800081aa15fdaac2/autotvm_relay_x86.ipynb
index 1416cc838a..76a3ecae0d 100644
--- a/docs/_downloads/2f91b1346a0ba21b800081aa15fdaac2/autotvm_relay_x86.ipynb
+++ b/docs/_downloads/2f91b1346a0ba21b800081aa15fdaac2/autotvm_relay_x86.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
@@ -51,7 +51,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Downloading and Loading the ONNX Model\n\nFor this tutorial, we will be working with ResNet-50 v2. ResNet-50 is a\nconvolutional neural network that is 50 layers deep and designed to classify\nimages. The model we will be using has been pre-trained on more than a\nmillion images with 1000 different classifications. The network has an input\nimage size of 224x224. If you are interested exploring more of how the\nResNet-50 model is structured, we recommend downloading\n[Netron] [...]
+        "## Downloading and Loading the ONNX Model\n\nFor this tutorial, we will be working with ResNet-50 v2. ResNet-50 is a\nconvolutional neural network that is 50 layers deep and designed to classify\nimages. The model we will be using has been pre-trained on more than a\nmillion images with 1000 different classifications. The network has an input\nimage size of 224x224. If you are interested exploring more of how the\nResNet-50 model is structured, we recommend downloading\n[Netron] [...]
       ]
     },
     {
@@ -105,7 +105,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: Defining the Correct Target\n\n  Specifying the correct target can have a huge impact on the performance of\n  the compiled module, as it can take advantage of hardware features\n  available on the target. For more information, please refer to\n  `Auto-tuning a convolutional network for x86 CPU <tune_relay_x86>`.\n  We recommend identifying which CPU you are running, along with optional\n  features, and set the target appropriately. For example, for some\n  proce [...]
+        "<div class=\"alert alert-info\"><h4>Defining the Correct Target</h4><p>Specifying the correct target can have a huge impact on the performance of\nthe compiled module, as it can take advantage of hardware features\navailable on the target. For more information, please refer to\n`Auto-tuning a convolutional network for x86 CPU <tune_relay_x86>`.\nWe recommend identifying which CPU you are running, along with optional\nfeatures, and set the target appropriately. For example, for s [...]
       ]
     },
     {
@@ -238,14 +238,14 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: Defining the Tuning Search Algorithm\n\n  By default this search is guided using an `XGBoost Grid` algorithm.\n  Depending on your model complexity and amount of time available, you might\n  want to choose a different algorithm.\n\n"
+        "<div class=\"alert alert-info\"><h4>Defining the Tuning Search Algorithm</h4><p>By default this search is guided using an `XGBoost Grid` algorithm.\nDepending on your model complexity and amount of time available, you might\nwant to choose a different algorithm.</p></div>"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: Setting Tuning Parameters\n\n  In this example, in the interest of time, we set the number of trials and\n  early stopping to 10. You will likely see more performance improvements if\n  you set these values to be higher but this comes at the expense of time\n  spent tuning. The number of trials required for convergence will vary\n  depending on the specifics of the model and the target platform.\n\n"
+        "<div class=\"alert alert-info\"><h4>Setting Tuning Parameters</h4><p>In this example, in the interest of time, we set the number of trials and\nearly stopping to 10. You will likely see more performance improvements if\nyou set these values to be higher but this comes at the expense of time\nspent tuning. The number of trials required for convergence will vary\ndepending on the specifics of the model and the target platform.</p></div>"
       ]
     },
     {
diff --git a/docs/_downloads/2fb9ae7bf124f72614a43137cf2919cb/micro_tflite.py b/docs/_downloads/2fb9ae7bf124f72614a43137cf2919cb/micro_tflite.py
index 5822a1a1e9..cbdf6cd6f4 100644
--- a/docs/_downloads/2fb9ae7bf124f72614a43137cf2919cb/micro_tflite.py
+++ b/docs/_downloads/2fb9ae7bf124f72614a43137cf2919cb/micro_tflite.py
@@ -26,101 +26,9 @@ model with Relay.
 """
 
 ######################################################################
-# .. note::
-#     If you want to run this tutorial on the microTVM Reference VM, download the Jupyter
-#     notebook using the link at the bottom of this page and save it into the TVM directory. Then:
 #
-#     #. Login to the reference VM with a modified ``vagrant ssh`` command:
+#     .. include:: ../../../../gallery/how_to/work_with_microtvm/install_dependencies.rst
 #
-#         ``$ vagrant ssh -- -L8888:localhost:8888``
-#
-#     #. Install jupyter:  ``pip install jupyterlab``
-#     #. ``cd`` to the TVM directory.
-#     #. Install tflite: poetry install -E importer-tflite
-#     #. Launch Jupyter Notebook: ``jupyter notebook``
-#     #. Copy the localhost URL displayed, and paste it into your browser.
-#     #. Navigate to saved Jupyter Notebook (``.ipynb`` file).
-#
-#
-# Setup
-# -----
-#
-# Install TFLite
-# ^^^^^^^^^^^^^^
-#
-# To get started, TFLite package needs to be installed as prerequisite. You can do this in two ways:
-#
-# 1. Install tflite with ``pip``
-#
-#     .. code-block:: bash
-#
-#       pip install tflite=2.1.0 --user
-#
-# 2. Generate the TFLite package yourself. The steps are the following:
-#
-#     Get the flatc compiler.
-#     Please refer to https://github.com/google/flatbuffers for details
-#     and make sure it is properly installed.
-#
-#     .. code-block:: bash
-#
-#       flatc --version
-#
-#     Get the TFLite schema.
-#
-#     .. code-block:: bash
-#
-#       wget https://raw.githubusercontent.com/tensorflow/tensorflow/r1.13/tensorflow/lite/schema/schema.fbs
-#
-#     Generate TFLite package.
-#
-#     .. code-block:: bash
-#
-#       flatc --python schema.fbs
-#
-#     Add the current folder (which contains generated tflite module) to PYTHONPATH.
-#
-#     .. code-block:: bash
-#
-#       export PYTHONPATH=${PYTHONPATH:+$PYTHONPATH:}$(pwd)
-#
-# To validate that the TFLite package was installed successfully, ``python -c "import tflite"``
-#
-# Install Zephyr (physical hardware only)
-# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-#
-# When running this tutorial with a host simulation (the default), you can use the host ``gcc`` to
-# build a firmware image that simulates the device. When compiling to run on physical hardware, you
-# need to install a *toolchain* plus some target-specific dependencies. microTVM allows you to
-# supply any compiler and runtime that can launch the TVM RPC server, but to get started, this
-# tutorial relies on the Zephyr RTOS to provide these pieces.
-#
-# You can install Zephyr by following the
-# `Installation Instructions <https://docs.zephyrproject.org/latest/getting_started/index.html>`_.
-#
-# Aside: Recreating your own Pre-Trained TFLite model
-#  The tutorial downloads a pretrained TFLite model. When working with microcontrollers
-#  you need to be mindful these are highly resource constrained devices as such standard
-#  models like MobileNet may not fit into their modest memory.
-#
-#  For this tutorial, we'll make use of one of the TF Micro example models.
-#
-#  If you wish to replicate the training steps see:
-#  https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/examples/hello_world/train
-#
-#    .. note::
-#
-#      If you accidentally download the example pretrained model from:
-#
-#      ``wget https://storage.googleapis.com/download.tensorflow.org/models/tflite/micro/hello_world_2020_04_13.zip``
-#
-#      this will fail due to an unimplemented opcode (114)
-#
-# Load and prepare the Pre-Trained Model
-# --------------------------------------
-#
-# Load the pretrained TFLite model from a file in your current
-# directory into a buffer
 
 # sphinx_gallery_start_ignore
 from tvm import testing
@@ -129,6 +37,27 @@ testing.utils.install_request_hook(depth=3)
 # sphinx_gallery_end_ignore
 
 import os
+
+# By default, this tutorial runs on x86 CPU using TVM's C runtime. If you would like
+# to run on real Zephyr hardware, you must export the `TVM_MICRO_USE_HW` environment
+# variable. Otherwise (if you are using the C runtime), you can skip installing
+# Zephyr and CMSIS-NN. It takes ~20 minutes to install both of them.
+use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
+
+######################################################################
+#
+#     .. include:: ../../../../gallery/how_to/work_with_microtvm/install_zephyr.rst
+#
+
+######################################################################
+#
+#     .. include:: ../../../../gallery/how_to/work_with_microtvm/install_cmsis.rst
+#
+
+######################################################################
+# Import Python dependencies
+# -------------------------------
+#
 import json
 import tarfile
 import pathlib
@@ -140,7 +69,6 @@ from tvm import relay
 import tvm.contrib.utils
 from tvm.contrib.download import download_testdata
 
-use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
 model_url = "https://people.linaro.org/~tom.gall/sine_model.tflite"
 model_file = "sine_model.tflite"
 model_path = download_testdata(model_url, model_file, module="data")
@@ -207,8 +135,7 @@ if use_physical_hw:
     boards_file = pathlib.Path(tvm.micro.get_microtvm_template_projects("zephyr")) / "boards.json"
     with open(boards_file) as f:
         boards = json.load(f)
-
-    BOARD = os.getenv("TVM_MICRO_BOARD", default="nucleo_f746zg")
+    BOARD = os.getenv("TVM_MICRO_BOARD", default="nucleo_l4r5zi")
     SERIAL = os.getenv("TVM_MICRO_SERIAL", default=None)
     TARGET = tvm.target.target.micro(boards[BOARD]["model"])
 
@@ -292,7 +219,14 @@ project_options = {}  # You can use options to provide platform-specific options
 
 if use_physical_hw:
     template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects("zephyr"))
-    project_options = {"project_type": "host_driven", "board": BOARD, "serial_number": SERIAL}
+    project_options = {
+        "project_type": "host_driven",
+        "board": BOARD,
+        "serial_number": SERIAL,
+        "config_main_stack_size": 4096,
+        "cmsis_path": os.getenv("CMSIS_PATH", default="/content/cmsis"),
+        "zephyr_base": os.getenv("ZEPHYR_BASE", default="/content/zephyrproject/zephyr"),
+    }
 
 # Create a temporary directory
 
diff --git a/docs/_downloads/37bbf9e2065ec8deeb64a8d9fa0755bc/autotvm_matmul_x86.ipynb b/docs/_downloads/37bbf9e2065ec8deeb64a8d9fa0755bc/autotvm_matmul_x86.ipynb
index 37040c308e..eca8a86119 100644
--- a/docs/_downloads/37bbf9e2065ec8deeb64a8d9fa0755bc/autotvm_matmul_x86.ipynb
+++ b/docs/_downloads/37bbf9e2065ec8deeb64a8d9fa0755bc/autotvm_matmul_x86.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
@@ -108,7 +108,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: More Explanation on :code:`cfg.define_split`\n\n In this template, :code:`cfg.define_split(\"tile_y\", y, num_outputs=2)` will\n enumerate all possible combinations that can split axis y into two axes with\n factors of the length of y. For example, if the length of y is 32 and we\n want to split it into two axes using factors of 32, then there are 6\n possible values for (length of outer axis, length of inner axis) pair,\n namely (32, 1), (16, 2), (8, 4), (4, 8), [...]
+        "<div class=\"alert alert-info\"><h4>More Explanation on :code:`cfg.define_split`</h4><p></p></div> In this template, :code:`cfg.define_split(\"tile_y\", y, num_outputs=2)` will\n enumerate all possible combinations that can split axis y into two axes with\n factors of the length of y. For example, if the length of y is 32 and we\n want to split it into two axes using factors of 32, then there are 6\n possible values for (length of outer axis, length of inner axis) pair,\n namely [...]
       ]
     },
     {
diff --git a/docs/_downloads/399e1d7889ca66b69d51655784827503/deploy_object_detection_pytorch.ipynb b/docs/_downloads/399e1d7889ca66b69d51655784827503/deploy_object_detection_pytorch.ipynb
index 5f40a6b503..20fd2ef2d0 100644
--- a/docs/_downloads/399e1d7889ca66b69d51655784827503/deploy_object_detection_pytorch.ipynb
+++ b/docs/_downloads/399e1d7889ca66b69d51655784827503/deploy_object_detection_pytorch.ipynb
@@ -8,14 +8,14 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Compile PyTorch Object Detection Models\nThis article is an introductory tutorial to deploy PyTorch object\ndetection models with Relay VM.\n\nFor us to begin with, PyTorch should be installed.\nTorchVision is also required since we will be using it as our model zoo.\n\nA quick solution is to install via pip\n\n```bash\npip install torch==1.7.0\npip install torchvision==0.8.1\n```\nor please refer to official site\nhttps://pytorch.org/get-started/locally/\n\nPyTorch versions [...]
+        "\n# Compile PyTorch Object Detection Models\nThis article is an introductory tutorial to deploy PyTorch object\ndetection models with Relay VM.\n\nFor us to begin with, PyTorch should be installed.\nTorchVision is also required since we will be using it as our model zoo.\n\nA quick solution is to install via pip\n\n```bash\npip install torch\npip install torchvision\n```\nor please refer to official site\nhttps://pytorch.org/get-started/locally/\n\nPyTorch versions should be bac [...]
       ]
     },
     {
diff --git a/docs/_downloads/3a9b1d387f618487c8ccf6b8b78ae179/intro_topi.py b/docs/_downloads/3a9b1d387f618487c8ccf6b8b78ae179/intro_topi.py
index e10a74c849..f2a4db6086 100644
--- a/docs/_downloads/3a9b1d387f618487c8ccf6b8b78ae179/intro_topi.py
+++ b/docs/_downloads/3a9b1d387f618487c8ccf6b8b78ae179/intro_topi.py
@@ -27,6 +27,7 @@ In this tutorial, we will see how TOPI can save us from writing boilerplate code
 """
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/3aeab7c9d659bf5da70126a1aff7c403/from_coreml.py b/docs/_downloads/3aeab7c9d659bf5da70126a1aff7c403/from_coreml.py
index 96d2967947..4d0eea2d8d 100644
--- a/docs/_downloads/3aeab7c9d659bf5da70126a1aff7c403/from_coreml.py
+++ b/docs/_downloads/3aeab7c9d659bf5da70126a1aff7c403/from_coreml.py
@@ -23,13 +23,12 @@ Compile CoreML Models
 
 This article is an introductory tutorial to deploy CoreML models with Relay.
 
-For us to begin with, coremltools module is required to be installed.
-
-A quick solution is to install via pip
+To begin, we must install coremltools:
 
 .. code-block:: bash
 
-    pip install -U coremltools --user
+    %%shell
+    pip install coremltools
 
 or please refer to official site
 https://github.com/apple/coremltools
diff --git a/docs/_downloads/3b5e41b16a898b72d18127ebe2182c66/tensorize.ipynb b/docs/_downloads/3b5e41b16a898b72d18127ebe2182c66/tensorize.ipynb
index c3e2eb8c27..98fd44e5ff 100644
--- a/docs/_downloads/3b5e41b16a898b72d18127ebe2182c66/tensorize.ipynb
+++ b/docs/_downloads/3b5e41b16a898b72d18127ebe2182c66/tensorize.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/3c5c85c3954f3110f16ca084e286f03a/opt_conv_cuda.py b/docs/_downloads/3c5c85c3954f3110f16ca084e286f03a/opt_conv_cuda.py
index e5b452af66..33e5d98553 100644
--- a/docs/_downloads/3c5c85c3954f3110f16ca084e286f03a/opt_conv_cuda.py
+++ b/docs/_downloads/3c5c85c3954f3110f16ca084e286f03a/opt_conv_cuda.py
@@ -31,6 +31,7 @@ channel, batch.
 """
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/3dd2108354ac3028c96bcd6a0c7899dd/relay_quick_start.ipynb b/docs/_downloads/3dd2108354ac3028c96bcd6a0c7899dd/relay_quick_start.ipynb
index c679065a73..a5a6e8902b 100644
--- a/docs/_downloads/3dd2108354ac3028c96bcd6a0c7899dd/relay_quick_start.ipynb
+++ b/docs/_downloads/3dd2108354ac3028c96bcd6a0c7899dd/relay_quick_start.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
diff --git a/docs/_downloads/3fde0fe8b31bf786dec2a01858372eae/deploy_model_on_nano.py b/docs/_downloads/3fde0fe8b31bf786dec2a01858372eae/deploy_model_on_nano.py
index 5e59dccf20..3d8a4a796f 100644
--- a/docs/_downloads/3fde0fe8b31bf786dec2a01858372eae/deploy_model_on_nano.py
+++ b/docs/_downloads/3fde0fe8b31bf786dec2a01858372eae/deploy_model_on_nano.py
@@ -26,6 +26,7 @@ it on Jetson Nano.
 """
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/4459ebf5b03d332f7f380abdaef81c05/tensor_expr_get_started.ipynb b/docs/_downloads/4459ebf5b03d332f7f380abdaef81c05/tensor_expr_get_started.ipynb
index 9abc7a9294..1d14e1f220 100644
--- a/docs/_downloads/4459ebf5b03d332f7f380abdaef81c05/tensor_expr_get_started.ipynb
+++ b/docs/_downloads/4459ebf5b03d332f7f380abdaef81c05/tensor_expr_get_started.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
@@ -87,7 +87,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: Lambda Functions\n\n  The second argument to the ``te.compute`` method is the function that\n  performs the computation. In this example, we're using an anonymous function,\n  also known as a ``lambda`` function, to define the computation, in this case\n  addition on the ``i``\\th element of ``A`` and ``B``.\n\n"
+        "<div class=\"alert alert-info\"><h4>Lambda Functions</h4><p>The second argument to the ``te.compute`` method is the function that\nperforms the computation. In this example, we're using an anonymous function,\nalso known as a ``lambda`` function, to define the computation, in this case\naddition on the ``i``\\th element of ``A`` and ``B``.</p></div>"
       ]
     },
     {
@@ -256,7 +256,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: Code Specialization\n\n  As you may have noticed, the declarations of ``A``, ``B`` and ``C`` all\n  take the same shape argument, ``n``. TVM will take advantage of this to\n  pass only a single shape argument to the kernel, as you will find in the\n  printed device code. This is one form of specialization.\n\n  On the host side, TVM will automatically generate check code that checks\n  the constraints in the parameters. So if you pass arrays with different\n  sha [...]
+        "<div class=\"alert alert-info\"><h4>Code Specialization</h4><p>As you may have noticed, the declarations of ``A``, ``B`` and ``C`` all\ntake the same shape argument, ``n``. TVM will take advantage of this to\npass only a single shape argument to the kernel, as you will find in the\nprinted device code. This is one form of specialization.\n\nOn the host side, TVM will automatically generate check code that checks\nthe constraints in the parameters. So if you pass arrays with diff [...]
       ]
     },
     {
@@ -306,7 +306,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: Module Storage Format\n\n  The CPU (host) module is directly saved as a shared library (.so). There\n  can be multiple customized formats of the device code. In our example, the\n  device code is stored in ptx, as well as a meta data json file. They can be\n  loaded and linked separately via import.\n\n"
+        "<div class=\"alert alert-info\"><h4>Module Storage Format</h4><p>The CPU (host) module is directly saved as a shared library (.so). There\ncan be multiple customized formats of the device code. In our example, the\ndevice code is stored in ptx, as well as a meta data json file. They can be\nloaded and linked separately via import.</p></div>"
       ]
     },
     {
@@ -349,7 +349,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: Runtime API and Thread-Safety\n\n  The compiled modules of TVM do not depend on the TVM compiler. Instead,\n  they only depend on a minimum runtime library. The TVM runtime library\n  wraps the device drivers and provides thread-safe and device agnostic calls\n  into the compiled functions.\n\n  This means that you can call the compiled TVM functions from any thread, on\n  any GPUs, provided that you have compiled the code for that GPU.\n\n"
+        "<div class=\"alert alert-info\"><h4>Runtime API and Thread-Safety</h4><p>The compiled modules of TVM do not depend on the TVM compiler. Instead,\nthey only depend on a minimum runtime library. The TVM runtime library\nwraps the device drivers and provides thread-safe and device agnostic calls\ninto the compiled functions.\n\nThis means that you can call the compiled TVM functions from any thread, on\nany GPUs, provided that you have compiled the code for that GPU.</p></div>"
       ]
     },
     {
@@ -374,7 +374,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: TE Scheduling Primitives\n\n  TVM includes a number of different scheduling primitives:\n\n  - split: splits a specified axis into two axises by the defined factor.\n  - tile: tiles will split a computation across two axes by the defined factors.\n  - fuse: fuses two consecutive axises of one computation.\n  - reorder: can reorder the axises of a computation into a defined order.\n  - bind: can bind a computation to a specific thread, useful in GPU programming.\n [...]
+        "<div class=\"alert alert-info\"><h4>TE Scheduling Primitives</h4><p>TVM includes a number of different scheduling primitives:\n\n- split: splits a specified axis into two axises by the defined factor.\n- tile: tiles will split a computation across two axes by the defined factors.\n- fuse: fuses two consecutive axises of one computation.\n- reorder: can reorder the axises of a computation into a defined order.\n- bind: can bind a computation to a specific thread, useful in GPU pr [...]
       ]
     },
     {
diff --git a/docs/_downloads/4bbcfcce3c35b0b795a42c998ceb3770/from_mxnet.ipynb b/docs/_downloads/4bbcfcce3c35b0b795a42c998ceb3770/from_mxnet.ipynb
index 86081b5bc2..331ebb72ca 100644
--- a/docs/_downloads/4bbcfcce3c35b0b795a42c998ceb3770/from_mxnet.ipynb
+++ b/docs/_downloads/4bbcfcce3c35b0b795a42c998ceb3770/from_mxnet.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n\n# Compile MXNet Models\n**Author**: [Joshua Z. Zhang](https://zhreshold.github.io/),             [Kazutaka Morita](https://github.com/kazum)\n\nThis article is an introductory tutorial to deploy mxnet models with Relay.\n\nFor us to begin with, mxnet module is required to be installed.\n\nA quick solution is\n\n```bash\npip install mxnet --user\n```\nor please refer to official installation guide.\nhttps://mxnet.apache.org/versions/master/install/index.html\n"
+        "\n\n# Compile MXNet Models\n**Author**: [Joshua Z. Zhang](https://zhreshold.github.io/),             [Kazutaka Morita](https://github.com/kazum)\n\nThis article is an introductory tutorial to deploy mxnet models with Relay. To begin, we must install `mxnet`:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install mxnet"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "or please refer to official installation guide.\nhttps://mxnet.apache.org/versions/master/install/index.html\n"
       ]
     },
     {
diff --git a/docs/_downloads/4d3f955a709b320db0d42740fead8ac1/matrix_multiply_opt.ipynb b/docs/_downloads/4d3f955a709b320db0d42740fead8ac1/matrix_multiply_opt.ipynb
index d1b00d0aed..fec4acbd76 100644
--- a/docs/_downloads/4d3f955a709b320db0d42740fead8ac1/matrix_multiply_opt.ipynb
+++ b/docs/_downloads/4d3f955a709b320db0d42740fead8ac1/matrix_multiply_opt.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/4dc30a43f3a6aa3ed4bc3077ad35ff70/tune_network_arm.ipynb b/docs/_downloads/4dc30a43f3a6aa3ed4bc3077ad35ff70/tune_network_arm.ipynb
index 1fd407aed5..97b5d061d3 100644
--- a/docs/_downloads/4dc30a43f3a6aa3ed4bc3077ad35ff70/tune_network_arm.ipynb
+++ b/docs/_downloads/4dc30a43f3a6aa3ed4bc3077ad35ff70/tune_network_arm.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/55a9eff88b1303e525d53269eeb16897/micro_ethosu.ipynb b/docs/_downloads/55a9eff88b1303e525d53269eeb16897/micro_ethosu.ipynb
index b079d0c82a..9edb809c86 100644
--- a/docs/_downloads/55a9eff88b1303e525d53269eeb16897/micro_ethosu.ipynb
+++ b/docs/_downloads/55a9eff88b1303e525d53269eeb16897/micro_ethosu.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
@@ -40,7 +40,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Installing additional python dependencies\n\nIn order to run the demo, you will need some additional python packages.\nThese can be installed by using the requirements.txt file below:\n\n.. code-block:: text\n   :caption: requirements.txt\n   :name: requirements.txt\n\n    attrs==21.2.0\n    cloudpickle==2.0.0\n    decorator==5.1.0\n    ethos-u-vela==3.5.0\n    flatbuffers==1.12\n    lxml==4.6.3\n    nose==1.3.7\n    numpy==1.19.5\n    Pillow==8.3.2\n    psutil==5.8.0\n    sc [...]
+        "## Installing additional python dependencies\n\nIn order to run the demo, you will need some additional python packages.\nThese can be installed by using the requirements.txt file below:\n\nThese packages can be installed by running the following from the command line:\n\n```bash\npip install -r requirements.txt\n```\n"
       ]
     },
     {
@@ -96,21 +96,21 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Pre-processing the image\n\nThe following script will create 2 C header files in the src directory:\n\n* ``inputs.h`` - The image supplied as an argument to the script will be converted\n  to an array of integers for input to our MobileNet v1 model.\n* ``outputs.h`` - An integer array of zeroes will reserve 1001 integer values\n  for the output of inference.\n\n.. code-block:: python\n   :caption: convert_image.py\n   :name: convert_image.py\n\n    #!python ./convert_image.py [...]
+        "## Pre-processing the image\n\nThe following script will create 2 C header files in the src directory:\n\n* ``inputs.h`` - The image supplied as an argument to the script will be converted\n  to an array of integers for input to our MobileNet v1 model.\n* ``outputs.h`` - An integer array of zeroes will reserve 1001 integer values\n  for the output of inference.\n\nRun the script from the command line:\n\n```bash\npython convert_image.py ./kitten.jpg\n```\n"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Pre-processing the labels\n\nThe following script will create a ``labels.h`` header file in the src directory.\nThe labels.txt file that we downloaded previously will be turned\ninto an array of strings. This array will be used to display the label that\nour image has been classified as.\n\n.. code-block:: python\n   :caption: convert_labels.py\n   :name: convert_labels.py\n\n    #!python ./convert_labels.py\n    import os\n    import pathlib\n    import sys\n\n\n    def crea [...]
+        "## Pre-processing the labels\n\nThe following script will create a ``labels.h`` header file in the src directory.\nThe labels.txt file that we downloaded previously will be turned\ninto an array of strings. This array will be used to display the label that\nour image has been classified as.\n\nRun the script from the command line:\n\n```bash\npython convert_labels.py\n```\n"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Writing the demo application\n\nThe following C application will run a single inference of the MobileNet v1\nmodel on the image that we downloaded and converted to an array of integers\npreviously. Since the model was compiled with a target of \"ethos-u ...\",\noperators supported by the Ethos(TM)-U55 NPU will be offloaded for acceleration.\nOnce the application is built and run, our test image should be correctly\nclassied as a \"tabby\" and the result should be displayed on [...]
+        "## Writing the demo application\n\nThe following C application will run a single inference of the MobileNet v1\nmodel on the image that we downloaded and converted to an array of integers\npreviously. Since the model was compiled with a target of \"ethos-u ...\",\noperators supported by the Ethos(TM)-U55 NPU will be offloaded for acceleration.\nOnce the application is built and run, our test image should be correctly\nclassied as a \"tabby\" and the result should be displayed on [...]
       ]
     },
     {
diff --git a/docs/_downloads/5b279d8a8718816263fa65b0eef1a5c0/micro_tflite.ipynb b/docs/_downloads/5b279d8a8718816263fa65b0eef1a5c0/micro_tflite.ipynb
index 132300b5f5..e99e9f04b6 100644
--- a/docs/_downloads/5b279d8a8718816263fa65b0eef1a5c0/micro_tflite.ipynb
+++ b/docs/_downloads/5b279d8a8718816263fa65b0eef1a5c0/micro_tflite.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
@@ -22,7 +22,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "<div class=\"alert alert-info\"><h4>Note</h4><p>If you want to run this tutorial on the microTVM Reference VM, download the Jupyter\n    notebook using the link at the bottom of this page and save it into the TVM directory. Then:\n\n    #. Login to the reference VM with a modified ``vagrant ssh`` command:\n\n        ``$ vagrant ssh -- -L8888:localhost:8888``\n\n    #. Install jupyter:  ``pip install jupyterlab``\n    #. ``cd`` to the TVM directory.\n    #. Install tflite: poetry [...]
+        "## Install microTVM Python dependencies\n\nTVM does not include a package for Python serial communication, so\nwe must install one before using microTVM. We will also need TFLite\nto load models.\n"
       ]
     },
     {
@@ -33,7 +33,72 @@
       },
       "outputs": [],
       "source": [
-        "import os\nimport json\nimport tarfile\nimport pathlib\nimport tempfile\nimport numpy as np\n\nimport tvm\nfrom tvm import relay\nimport tvm.contrib.utils\nfrom tvm.contrib.download import download_testdata\n\nuse_physical_hw = bool(os.getenv(\"TVM_MICRO_USE_HW\"))\nmodel_url = \"https://people.linaro.org/~tom.gall/sine_model.tflite\"\nmodel_file = \"sine_model.tflite\"\nmodel_path = download_testdata(model_url, model_file, module=\"data\")\n\ntflite_model_buf = open(model_path, [...]
+        "%%shell\npip install pyserial==3.5 tflite==2.1"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import os\n\n# By default, this tutorial runs on x86 CPU using TVM's C runtime. If you would like\n# to run on real Zephyr hardware, you must export the `TVM_MICRO_USE_HW` environment\n# variable. Otherwise (if you are using the C runtime), you can skip installing\n# Zephyr and CMSIS-NN. It takes ~20 minutes to install both of them.\nuse_physical_hw = bool(os.getenv(\"TVM_MICRO_USE_HW\"))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Install Zephyr\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\n# Install west and ninja\npython3 -m pip install west\napt-get install -y ninja-build\n\n# Install ZephyrProject\nZEPHYR_PROJECT_PATH=\"/content/zephyrproject\"\nexport ZEPHYR_BASE=${ZEPHYR_PROJECT_PATH}/zephyr\nwest init ${ZEPHYR_PROJECT_PATH}\ncd ${ZEPHYR_BASE}\ngit checkout v2.7-branch\ncd ..\nwest update\nwest zephyr-export\nchmod -R o+w ${ZEPHYR_PROJECT_PATH}\n\n# Install Zephyr SDK\nZEPHYR_SDK_VERSION=0.13.2\nZEPHYR_SDK_FILE=\"/content/zephyr-sdk-linux-setup.run\" [...]
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Install CMSIS-NN\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\nCMSIS_SHA=\"51263182d16c92649a48144ba56c0945f9fce60e\"\nCMSIS_URL=\"http://github.com/ARM-software/CMSIS_5/archive/${CMSIS_SHA}.tar.gz\"\nexport CMSIS_PATH=/content/cmsis\nDOWNLOAD_PATH=\"/content/${CMSIS_SHA}.tar.gz\"\nmkdir ${CMSIS_PATH}\nwget ${CMSIS_URL} -O \"${DOWNLOAD_PATH}\"\ntar -xf \"${DOWNLOAD_PATH}\" -C ${CMSIS_PATH} --strip-components=1\nrm ${DOWNLOAD_PATH}"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Import Python dependencies\n\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import json\nimport tarfile\nimport pathlib\nimport tempfile\nimport numpy as np\n\nimport tvm\nfrom tvm import relay\nimport tvm.contrib.utils\nfrom tvm.contrib.download import download_testdata\n\nmodel_url = \"https://people.linaro.org/~tom.gall/sine_model.tflite\"\nmodel_file = \"sine_model.tflite\"\nmodel_path = download_testdata(model_url, model_file, module=\"data\")\n\ntflite_model_buf = open(model_path, \"rb\").read()"
       ]
     },
     {
@@ -105,7 +170,7 @@
       },
       "outputs": [],
       "source": [
-        "RUNTIME = tvm.relay.backend.Runtime(\"crt\", {\"system-lib\": True})\nTARGET = tvm.target.target.micro(\"host\")\n\n#\n# Compiling for physical hardware\n#  When running on physical hardware, choose a TARGET and a BOARD that describe the hardware. The\n#  STM32F746 Nucleo target and board is chosen in the example below. Another option would be to\n#  choose the STM32F746 Discovery board instead. Since that board has the same MCU as the Nucleo\n#  board but a couple of wirings an [...]
+        "RUNTIME = tvm.relay.backend.Runtime(\"crt\", {\"system-lib\": True})\nTARGET = tvm.target.target.micro(\"host\")\n\n#\n# Compiling for physical hardware\n#  When running on physical hardware, choose a TARGET and a BOARD that describe the hardware. The\n#  STM32F746 Nucleo target and board is chosen in the example below. Another option would be to\n#  choose the STM32F746 Discovery board instead. Since that board has the same MCU as the Nucleo\n#  board but a couple of wirings an [...]
       ]
     },
     {
@@ -123,7 +188,7 @@
       },
       "outputs": [],
       "source": [
-        "with tvm.transform.PassContext(\n    opt_level=3, config={\"tir.disable_vectorize\": True}, disabled_pass=[\"AlterOpLayout\"]\n):\n    module = relay.build(mod, target=TARGET, runtime=RUNTIME, params=params)\n\n\n# Inspecting the compilation output\n# ---------------------------------\n#\n# The compilation process has produced some C code implementing the operators in this graph. We\n# can inspect it by printing the CSourceModule contents (for the purposes of this tutorial, let' [...]
+        "with tvm.transform.PassContext(\n    opt_level=3, config={\"tir.disable_vectorize\": True}, disabled_pass=[\"AlterOpLayout\"]\n):\n    module = relay.build(mod, target=TARGET, runtime=RUNTIME, params=params)\n\n\n# Inspecting the compilation output\n# ---------------------------------\n#\n# The compilation process has produced some C code implementing the operators in this graph. We\n# can inspect it by printing the CSourceModule contents (for the purposes of this tutorial, let' [...]
       ]
     },
     {
diff --git a/docs/_downloads/5c7000b5aef924e29ec975ec3002ea03/tensor_ir_blitz_course.py b/docs/_downloads/5c7000b5aef924e29ec975ec3002ea03/tensor_ir_blitz_course.py
index a62fa39793..dc75a3fb94 100644
--- a/docs/_downloads/5c7000b5aef924e29ec975ec3002ea03/tensor_ir_blitz_course.py
+++ b/docs/_downloads/5c7000b5aef924e29ec975ec3002ea03/tensor_ir_blitz_course.py
@@ -30,6 +30,7 @@ TensorIR is a domain specific language for deep learning programs serving two br
 """
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/5e4e499c097b16a90c517e630502253a/tune_network_mali.ipynb b/docs/_downloads/5e4e499c097b16a90c517e630502253a/tune_network_mali.ipynb
index c9f727bb3c..4ec99f43d3 100644
--- a/docs/_downloads/5e4e499c097b16a90c517e630502253a/tune_network_mali.ipynb
+++ b/docs/_downloads/5e4e499c097b16a90c517e630502253a/tune_network_mali.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/5f1f7bd7d90710fd404f7bcdc4965622/tune_conv2d_layer_cuda.ipynb b/docs/_downloads/5f1f7bd7d90710fd404f7bcdc4965622/tune_conv2d_layer_cuda.ipynb
index 23797ebfd7..52977a93c6 100644
--- a/docs/_downloads/5f1f7bd7d90710fd404f7bcdc4965622/tune_conv2d_layer_cuda.ipynb
+++ b/docs/_downloads/5f1f7bd7d90710fd404f7bcdc4965622/tune_conv2d_layer_cuda.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
diff --git a/docs/_downloads/63f9e50204143ea3c2d3593c72439b3d/intro_topi.ipynb b/docs/_downloads/63f9e50204143ea3c2d3593c72439b3d/intro_topi.ipynb
index be30c50eef..ce4519e8c0 100644
--- a/docs/_downloads/63f9e50204143ea3c2d3593c72439b3d/intro_topi.ipynb
+++ b/docs/_downloads/63f9e50204143ea3c2d3593c72439b3d/intro_topi.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
diff --git a/docs/_downloads/66e1a42229aae7ed49ac268f520e6727/deploy_detection.ipynb b/docs/_downloads/66e1a42229aae7ed49ac268f520e6727/deploy_detection.ipynb
index 0aaa23f5ac..0adfe83025 100644
--- a/docs/_downloads/66e1a42229aae7ed49ac268f520e6727/deploy_detection.ipynb
+++ b/docs/_downloads/66e1a42229aae7ed49ac268f520e6727/deploy_detection.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/6ad550da5092845382b1197f58a93816/tune_conv2d_cuda.py b/docs/_downloads/6ad550da5092845382b1197f58a93816/tune_conv2d_cuda.py
index 4560cf881e..a73b97525f 100644
--- a/docs/_downloads/6ad550da5092845382b1197f58a93816/tune_conv2d_cuda.py
+++ b/docs/_downloads/6ad550da5092845382b1197f58a93816/tune_conv2d_cuda.py
@@ -49,6 +49,7 @@ __name__ == "__main__":` block.
 # Now return to python code. Import packages.
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/6b0f549107f73f2e48c894372be08bcb/tune_relay_mobile_gpu.ipynb b/docs/_downloads/6b0f549107f73f2e48c894372be08bcb/tune_relay_mobile_gpu.ipynb
index 636c4b003e..161cbf6c98 100644
--- a/docs/_downloads/6b0f549107f73f2e48c894372be08bcb/tune_relay_mobile_gpu.ipynb
+++ b/docs/_downloads/6b0f549107f73f2e48c894372be08bcb/tune_relay_mobile_gpu.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/6e0673ce1f08636c34d0b9a73ea114f7/uma.ipynb b/docs/_downloads/6e0673ce1f08636c34d0b9a73ea114f7/uma.ipynb
index e8db02fc30..e98f2e035c 100644
--- a/docs/_downloads/6e0673ce1f08636c34d0b9a73ea114f7/uma.ipynb
+++ b/docs/_downloads/6e0673ce1f08636c34d0b9a73ea114f7/uma.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/6e511f5a8ddbf12f2fca2dfadc0cc4a9/micro_tvmc.ipynb b/docs/_downloads/6e511f5a8ddbf12f2fca2dfadc0cc4a9/micro_tvmc.ipynb
index 6c23727fd4..f1519a2f1a 100644
--- a/docs/_downloads/6e511f5a8ddbf12f2fca2dfadc0cc4a9/micro_tvmc.ipynb
+++ b/docs/_downloads/6e511f5a8ddbf12f2fca2dfadc0cc4a9/micro_tvmc.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/729378592a96230b4f7be71b44da43a4/scan.ipynb b/docs/_downloads/729378592a96230b4f7be71b44da43a4/scan.ipynb
index dfb6cccea6..65ff632d40 100644
--- a/docs/_downloads/729378592a96230b4f7be71b44da43a4/scan.ipynb
+++ b/docs/_downloads/729378592a96230b4f7be71b44da43a4/scan.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
diff --git a/docs/_downloads/732ed130cbc15432e737da8cc47e1734/tune_conv2d_cuda.ipynb b/docs/_downloads/732ed130cbc15432e737da8cc47e1734/tune_conv2d_cuda.ipynb
index d5c2590b17..fe57e3b5a7 100644
--- a/docs/_downloads/732ed130cbc15432e737da8cc47e1734/tune_conv2d_cuda.ipynb
+++ b/docs/_downloads/732ed130cbc15432e737da8cc47e1734/tune_conv2d_cuda.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
diff --git a/docs/_downloads/7372db5919b5619bc34fde3434862bca/opt_conv_tensorcore.py b/docs/_downloads/7372db5919b5619bc34fde3434862bca/opt_conv_tensorcore.py
index 8db20b9b9b..5734f064f0 100644
--- a/docs/_downloads/7372db5919b5619bc34fde3434862bca/opt_conv_tensorcore.py
+++ b/docs/_downloads/7372db5919b5619bc34fde3434862bca/opt_conv_tensorcore.py
@@ -52,6 +52,7 @@ convolution has a large batch. We strongly recommend covering the :ref:`opt-conv
 # NHWCnc memory layout.The following code defines the convolution algorithm in TVM.
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/7455981870c23c8c76482dedf33d8a42/opt_conv_tensorcore.ipynb b/docs/_downloads/7455981870c23c8c76482dedf33d8a42/opt_conv_tensorcore.ipynb
index 30326660eb..40e64b99d5 100644
--- a/docs/_downloads/7455981870c23c8c76482dedf33d8a42/opt_conv_tensorcore.ipynb
+++ b/docs/_downloads/7455981870c23c8c76482dedf33d8a42/opt_conv_tensorcore.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
diff --git a/docs/_downloads/7716f96385bd5abb6e822041e285be54/from_darknet.py b/docs/_downloads/7716f96385bd5abb6e822041e285be54/from_darknet.py
index c12a9e7e15..8397efa63b 100644
--- a/docs/_downloads/7716f96385bd5abb6e822041e285be54/from_darknet.py
+++ b/docs/_downloads/7716f96385bd5abb6e822041e285be54/from_darknet.py
@@ -27,8 +27,9 @@ Please install CFFI and CV2 before executing this script
 
 .. code-block:: bash
 
-  pip install cffi
-  pip install opencv-python
+  %%shell
+  pip install cffi opencv-python
+
 """
 
 # sphinx_gallery_start_ignore
diff --git a/docs/_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py b/docs/_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py
index 0d8d0f2867..ffde042e2b 100644
--- a/docs/_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py
+++ b/docs/_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py
@@ -27,8 +27,8 @@ A quick solution is to install via pip
 
 .. code-block:: bash
 
-    pip install torch==1.7.0
-    pip install torchvision==0.8.1
+    pip install torch
+    pip install torchvision
 
 or please refer to official site
 https://pytorch.org/get-started/locally/
diff --git a/docs/_downloads/779f52a44f2b8ab22dc21eee0c27fd4d/from_onnx.ipynb b/docs/_downloads/779f52a44f2b8ab22dc21eee0c27fd4d/from_onnx.ipynb
index 8d879842f5..0c3b3a29d2 100644
--- a/docs/_downloads/779f52a44f2b8ab22dc21eee0c27fd4d/from_onnx.ipynb
+++ b/docs/_downloads/779f52a44f2b8ab22dc21eee0c27fd4d/from_onnx.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Compile ONNX Models\n**Author**: [Joshua Z. Zhang](https://zhreshold.github.io/)\n\nThis article is an introductory tutorial to deploy ONNX models with Relay.\n\nFor us to begin with, ONNX package must be installed.\n\nA quick solution is to install protobuf compiler, and\n\n```bash\npip install --user onnx onnxoptimizer\n```\nor please refer to official site.\nhttps://github.com/onnx/onnx\n"
+        "\n# Compile ONNX Models\n**Author**: [Joshua Z. Zhang](https://zhreshold.github.io/)\n\nThis article is an introductory tutorial to deploy ONNX models with Relay.\n\nTo begin, install the ONNX package:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install onnx onnxoptimizer"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Alternatively, you can refer to official site:\nhttps://github.com/onnx/onnx\n"
       ]
     },
     {
diff --git a/docs/_downloads/7c392f39b90d93406ef30c6185c5686c/deploy_model_on_rasp.ipynb b/docs/_downloads/7c392f39b90d93406ef30c6185c5686c/deploy_model_on_rasp.ipynb
index 3266a665f3..14f9a5c11f 100644
--- a/docs/_downloads/7c392f39b90d93406ef30c6185c5686c/deploy_model_on_rasp.ipynb
+++ b/docs/_downloads/7c392f39b90d93406ef30c6185c5686c/deploy_model_on_rasp.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/7ef06253b3d2676eb50e20a5f81ef8f9/micro_reference_vm.ipynb b/docs/_downloads/7ef06253b3d2676eb50e20a5f81ef8f9/micro_reference_vm.ipynb
index 753fa94c90..188456cb0c 100644
--- a/docs/_downloads/7ef06253b3d2676eb50e20a5f81ef8f9/micro_reference_vm.ipynb
+++ b/docs/_downloads/7ef06253b3d2676eb50e20a5f81ef8f9/micro_reference_vm.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/7ef14586a3b62fe120d97d5fedf72879/use_pass_infra.ipynb b/docs/_downloads/7ef14586a3b62fe120d97d5fedf72879/use_pass_infra.ipynb
index e841e9ef12..4e3d70da56 100644
--- a/docs/_downloads/7ef14586a3b62fe120d97d5fedf72879/use_pass_infra.ipynb
+++ b/docs/_downloads/7ef14586a3b62fe120d97d5fedf72879/use_pass_infra.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py b/docs/_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py
index 9a32397815..b85b9e669a 100644
--- a/docs/_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py
+++ b/docs/_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py
@@ -21,6 +21,11 @@ This article is an introductory tutorial to deploy tensorflow models with TVM.
 
 For us to begin with, tensorflow python module is required to be installed.
 
+.. code-block:: bash
+
+    %%shell
+    pip install tensorflow
+
 Please refer to https://www.tensorflow.org/install
 """
 
diff --git a/docs/_downloads/825671e45a9bdc4733400384984cd9dd/build_gcn.ipynb b/docs/_downloads/825671e45a9bdc4733400384984cd9dd/build_gcn.ipynb
index af9ebca2d7..5272ced86a 100644
--- a/docs/_downloads/825671e45a9bdc4733400384984cd9dd/build_gcn.ipynb
+++ b/docs/_downloads/825671e45a9bdc4733400384984cd9dd/build_gcn.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Building a Graph Convolutional Network\n**Author**: [Yulun Yao](https://yulunyao.io/),             [Chien-Yu Lin](https://homes.cs.washington.edu/~cyulin/)\n\nThis article is an introductory tutorial to build a Graph Convolutional Network (GCN) with Relay.\nIn this tutorial, we will run our GCN on Cora dataset to demonstrate.\nCora dataset is a common benchmark for Graph Neural Networks (GNN) and frameworks that support GNN training and inference.\nWe directly load the datas [...]
+        "\n# Building a Graph Convolutional Network\n**Author**: [Yulun Yao](https://yulunyao.io/),             [Chien-Yu Lin](https://homes.cs.washington.edu/~cyulin/)\n\nThis article is an introductory tutorial to build a Graph Convolutional Network (GCN) with Relay.\nIn this tutorial, we will run our GCN on Cora dataset to demonstrate.\nCora dataset is a common benchmark for Graph Neural Networks (GNN) and frameworks that support GNN training and inference.\nWe directly load the datas [...]
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install torch==1.9.0\npip install dgl==v0.7.2 -f https://data.dgl.ai/wheels/repo.html"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Please refer to DGL doc for installation at\nhttps://docs.dgl.ai/install/index.html.\n\nPlease refer to PyTorch guide for PyTorch installation at\nhttps://pytorch.org/get-started/locally/.\n"
       ]
     },
     {
diff --git a/docs/_downloads/83b9961c758069912464db3443fffc06/vta_get_started.ipynb b/docs/_downloads/83b9961c758069912464db3443fffc06/vta_get_started.ipynb
index 00d7ba2ba4..4ed64a4821 100644
--- a/docs/_downloads/83b9961c758069912464db3443fffc06/vta_get_started.ipynb
+++ b/docs/_downloads/83b9961c758069912464db3443fffc06/vta_get_started.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/83e3b018e8bac8d31bb331d200a33a04/from_tensorflow.ipynb b/docs/_downloads/83e3b018e8bac8d31bb331d200a33a04/from_tensorflow.ipynb
index 38e3fc9015..65a02aa049 100644
--- a/docs/_downloads/83e3b018e8bac8d31bb331d200a33a04/from_tensorflow.ipynb
+++ b/docs/_downloads/83e3b018e8bac8d31bb331d200a33a04/from_tensorflow.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Compile Tensorflow Models\nThis article is an introductory tutorial to deploy tensorflow models with TVM.\n\nFor us to begin with, tensorflow python module is required to be installed.\n\nPlease refer to https://www.tensorflow.org/install\n"
+        "\n# Compile Tensorflow Models\nThis article is an introductory tutorial to deploy tensorflow models with TVM.\n\nFor us to begin with, tensorflow python module is required to be installed.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install tensorflow"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Please refer to https://www.tensorflow.org/install\n"
       ]
     },
     {
diff --git a/docs/_downloads/8472bea81cf679760d7e4e77e895726f/extern_op.ipynb b/docs/_downloads/8472bea81cf679760d7e4e77e895726f/extern_op.ipynb
index 633027b3a1..6c5cc796dd 100644
--- a/docs/_downloads/8472bea81cf679760d7e4e77e895726f/extern_op.ipynb
+++ b/docs/_downloads/8472bea81cf679760d7e4e77e895726f/extern_op.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/854257a66df713b1f3f82eb3577f95e3/opt_conv_cuda.ipynb b/docs/_downloads/854257a66df713b1f3f82eb3577f95e3/opt_conv_cuda.ipynb
index d1f99f4d93..9f55992494 100644
--- a/docs/_downloads/854257a66df713b1f3f82eb3577f95e3/opt_conv_cuda.ipynb
+++ b/docs/_downloads/854257a66df713b1f3f82eb3577f95e3/opt_conv_cuda.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
diff --git a/docs/_downloads/8c7d8fd6a4b93bcff1f5573943dd02f4/scan.py b/docs/_downloads/8c7d8fd6a4b93bcff1f5573943dd02f4/scan.py
index d21673acd9..d523d5b995 100644
--- a/docs/_downloads/8c7d8fd6a4b93bcff1f5573943dd02f4/scan.py
+++ b/docs/_downloads/8c7d8fd6a4b93bcff1f5573943dd02f4/scan.py
@@ -26,6 +26,7 @@ from __future__ import absolute_import, print_function
 
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/8d55b8f991fb704002f768367ce2d1a2/tvmc_python.ipynb b/docs/_downloads/8d55b8f991fb704002f768367ce2d1a2/tvmc_python.ipynb
index ef2655599e..fe01ab9d3c 100644
--- a/docs/_downloads/8d55b8f991fb704002f768367ce2d1a2/tvmc_python.ipynb
+++ b/docs/_downloads/8d55b8f991fb704002f768367ce2d1a2/tvmc_python.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/910e6ecee4ecac8d8ca0baeb6d00689d/tune_relay_x86.ipynb b/docs/_downloads/910e6ecee4ecac8d8ca0baeb6d00689d/tune_relay_x86.ipynb
index 1c3e9602d3..b4a001add7 100644
--- a/docs/_downloads/910e6ecee4ecac8d8ca0baeb6d00689d/tune_relay_x86.ipynb
+++ b/docs/_downloads/910e6ecee4ecac8d8ca0baeb6d00689d/tune_relay_x86.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/95395e118195f25266654dd8fbf487d4/deploy_classification.ipynb b/docs/_downloads/95395e118195f25266654dd8fbf487d4/deploy_classification.ipynb
index beab65729d..6b5b7e11e0 100644
--- a/docs/_downloads/95395e118195f25266654dd8fbf487d4/deploy_classification.ipynb
+++ b/docs/_downloads/95395e118195f25266654dd8fbf487d4/deploy_classification.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/9ccca8fd489a1486ac71b55a55c320c5/micro_autotune.py b/docs/_downloads/9ccca8fd489a1486ac71b55a55c320c5/micro_autotune.py
index 13bf4efac1..3dd4cab6c9 100644
--- a/docs/_downloads/9ccca8fd489a1486ac71b55a55c320c5/micro_autotune.py
+++ b/docs/_downloads/9ccca8fd489a1486ac71b55a55c320c5/micro_autotune.py
@@ -27,13 +27,37 @@ Autotuning with microTVM
 This tutorial explains how to autotune a model using the C runtime.
 """
 
+######################################################################
+#
+#     .. include:: ../../../../gallery/how_to/work_with_microtvm/install_dependencies.rst
+#
+
 # sphinx_gallery_start_ignore
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
 # sphinx_gallery_end_ignore
 
+# You can skip the following two sections (installing Zephyr and CMSIS-NN) if the following flag is False.
+# Installing Zephyr takes ~20 min.
 import os
+
+use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
+
+######################################################################
+#
+#     .. include:: ../../../../gallery/how_to/work_with_microtvm/install_zephyr.rst
+#
+
+######################################################################
+#
+#     .. include:: ../../../../gallery/how_to/work_with_microtvm/install_cmsis.rst
+#
+
+######################################################################
+# Import Python dependencies
+# -------------------------------
+#
 import json
 import numpy as np
 import pathlib
@@ -41,8 +65,6 @@ import pathlib
 import tvm
 from tvm.relay.backend import Runtime
 
-use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
-
 ####################
 # Defining the model
 ####################
diff --git a/docs/_downloads/9f81bc348ac4107d0670f512b8943a99/introduction.ipynb b/docs/_downloads/9f81bc348ac4107d0670f512b8943a99/introduction.ipynb
index fdae47d1c3..44bc836a18 100644
--- a/docs/_downloads/9f81bc348ac4107d0670f512b8943a99/introduction.ipynb
+++ b/docs/_downloads/9f81bc348ac4107d0670f512b8943a99/introduction.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/a1417396e306d987107a7a39376ec261/tuple_inputs.ipynb b/docs/_downloads/a1417396e306d987107a7a39376ec261/tuple_inputs.ipynb
index 97f59a1638..460d8a2827 100644
--- a/docs/_downloads/a1417396e306d987107a7a39376ec261/tuple_inputs.ipynb
+++ b/docs/_downloads/a1417396e306d987107a7a39376ec261/tuple_inputs.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/a269cb38341b190be980a0bd3ea8a625/deploy_quantized.ipynb b/docs/_downloads/a269cb38341b190be980a0bd3ea8a625/deploy_quantized.ipynb
index 98b74cfafd..1dde4bd918 100644
--- a/docs/_downloads/a269cb38341b190be980a0bd3ea8a625/deploy_quantized.ipynb
+++ b/docs/_downloads/a269cb38341b190be980a0bd3ea8a625/deploy_quantized.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/a608d8b69371e9bc149dd89f6db2c38e/from_paddle.ipynb b/docs/_downloads/a608d8b69371e9bc149dd89f6db2c38e/from_paddle.ipynb
index aba2daed4d..1f6a414ad7 100644
--- a/docs/_downloads/a608d8b69371e9bc149dd89f6db2c38e/from_paddle.ipynb
+++ b/docs/_downloads/a608d8b69371e9bc149dd89f6db2c38e/from_paddle.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Compile PaddlePaddle Models\n**Author**: [Ziyuan Ma](https://github.com/ZiyuanMa/)\n\nThis article is an introductory tutorial to deploy PaddlePaddle models with Relay.\nFor us to begin with, PaddlePaddle>=2.1.3 is required to be installed.\nA quick solution is\n\n```bash\npip install paddlepaddle -i https://mirror.baidu.com/pypi/simple\n```\nor please refer to official site.\nhttps://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html\n"
+        "\n# Compile PaddlePaddle Models\n**Author**: [Ziyuan Ma](https://github.com/ZiyuanMa/)\n\nThis article is an introductory tutorial to deploy PaddlePaddle models with Relay.\nTo begin, we'll install PaddlePaddle>=2.1.3:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install paddlepaddle -i https://mirror.baidu.com/pypi/simple"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "For more details, refer to the official install instructions at:\nhttps://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html\n"
       ]
     },
     {
diff --git a/docs/_downloads/a70662bf8dc171d3d17a3945bbbb02e3/from_tflite.py b/docs/_downloads/a70662bf8dc171d3d17a3945bbbb02e3/from_tflite.py
index d1b78f11d5..a248346c29 100644
--- a/docs/_downloads/a70662bf8dc171d3d17a3945bbbb02e3/from_tflite.py
+++ b/docs/_downloads/a70662bf8dc171d3d17a3945bbbb02e3/from_tflite.py
@@ -25,9 +25,8 @@ To get started, TFLite package needs to be installed as prerequisite.
 
 .. code-block:: bash
 
-    # install tflite
-    pip install tflite==2.1.0 --user
-
+    %%shell
+    pip install tflite==2.1.0
 
 or you could generate TFLite package yourself. The steps are the following:
 
diff --git a/docs/_downloads/a7aff5918e1b86809a5bd1da8bef7229/tedd.ipynb b/docs/_downloads/a7aff5918e1b86809a5bd1da8bef7229/tedd.ipynb
index d09e028a29..11ade6c0ad 100644
--- a/docs/_downloads/a7aff5918e1b86809a5bd1da8bef7229/tedd.ipynb
+++ b/docs/_downloads/a7aff5918e1b86809a5bd1da8bef7229/tedd.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/a7c7ea4b5017ae70db1f51dd8e6dcd82/micro_train.ipynb b/docs/_downloads/a7c7ea4b5017ae70db1f51dd8e6dcd82/micro_train.ipynb
index f9b5d5553d..af36cf8c0b 100644
--- a/docs/_downloads/a7c7ea4b5017ae70db1f51dd8e6dcd82/micro_train.ipynb
+++ b/docs/_downloads/a7c7ea4b5017ae70db1f51dd8e6dcd82/micro_train.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
@@ -22,7 +22,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "<div class=\"alert alert-info\"><h4>Note</h4><p>This tutorial is best viewed as a Jupyter Notebook. You can download and run it locally\n  using the link at the bottom of this page, or open it online for free using Google Colab.\n  Click the icon below to open in Google Colab.</p></div>\n\n<img src=\"https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.png\" align=\"center\" target=\"https://colab.research.google.com/github/apache/tvm-site/blob/ [...]
+        "## Motivation\nWhen building IOT devices, we often want them to **see and understand** the world around them.\nThis can take many forms, but often times a device will want to know if a certain **kind of\nobject** is in its field of vision.\n\nFor example, a security camera might look for **people**, so it can decide whether to save a video\nto memory. A traffic light might look for **cars**, so it can judge which lights should change\nfirst. Or a forest camera might look for a * [...]
       ]
     },
     {
@@ -33,7 +33,7 @@
       },
       "outputs": [],
       "source": [
-        "%%bash\npip install -q tensorflow tflite\npip install -q tlcpack-nightly -f https://tlcpack.ai/wheels\napt-get -qq install imagemagick curl\n\n# Install Arduino CLI and library for Nano 33 BLE\ncurl -fsSL https://raw.githubusercontent.com/arduino/arduino-cli/master/install.sh | sh\n/content/bin/arduino-cli core update-index\n/content/bin/arduino-cli core install arduino:mbed_nano"
+        "%%shell\npip install -q tensorflow tflite\npip install -q tlcpack-nightly -f https://tlcpack.ai/wheels\napt-get -qq install imagemagick curl\n\n# Install Arduino CLI and library for Nano 33 BLE\ncurl -fsSL https://raw.githubusercontent.com/arduino/arduino-cli/master/install.sh | sh\n/content/bin/arduino-cli core update-index\n/content/bin/arduino-cli core install arduino:mbed_nano"
       ]
     },
     {
@@ -267,7 +267,7 @@
       },
       "outputs": [],
       "source": [
-        "%%bash\nmkdir -p ~/tests\ncurl \"https://i.imgur.com/JBbEhxN.png\" -o ~/tests/car_224.png\nconvert ~/tests/car_224.png -resize 64 ~/tests/car_64.png\nstream ~/tests/car_64.png ~/tests/car.raw\nbin2c -c -st ~/tests/car.raw --name CAR_IMAGE > ~/models/project/car.c\n\ncurl \"https://i.imgur.com/wkh7Dx2.png\" -o ~/tests/catan_224.png\nconvert ~/tests/catan_224.png -resize 64 ~/tests/catan_64.png\nstream ~/tests/catan_64.png ~/tests/catan.raw\nbin2c -c -st ~/tests/catan.raw --name C [...]
+        "%%shell\nmkdir -p ~/tests\ncurl \"https://i.imgur.com/JBbEhxN.png\" -o ~/tests/car_224.png\nconvert ~/tests/car_224.png -resize 64 ~/tests/car_64.png\nstream ~/tests/car_64.png ~/tests/car.raw\nbin2c -c -st ~/tests/car.raw --name CAR_IMAGE > ~/models/project/car.c\n\ncurl \"https://i.imgur.com/wkh7Dx2.png\" -o ~/tests/catan_224.png\nconvert ~/tests/catan_224.png -resize 64 ~/tests/catan_64.png\nstream ~/tests/catan_64.png ~/tests/catan.raw\nbin2c -c -st ~/tests/catan.raw --name  [...]
       ]
     },
     {
diff --git a/docs/_downloads/a883b8474634054b6a79c17a288aa8ed/from_coreml.ipynb b/docs/_downloads/a883b8474634054b6a79c17a288aa8ed/from_coreml.ipynb
index bb7fdd5755..c69c6eb14e 100644
--- a/docs/_downloads/a883b8474634054b6a79c17a288aa8ed/from_coreml.ipynb
+++ b/docs/_downloads/a883b8474634054b6a79c17a288aa8ed/from_coreml.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Compile CoreML Models\n**Author**: [Joshua Z. Zhang](https://zhreshold.github.io/),             [Kazutaka Morita](https://github.com/kazum),             [Zhao Wu](https://github.com/FrozenGene)\n\nThis article is an introductory tutorial to deploy CoreML models with Relay.\n\nFor us to begin with, coremltools module is required to be installed.\n\nA quick solution is to install via pip\n\n```bash\npip install -U coremltools --user\n```\nor please refer to official site\nhttp [...]
+        "\n# Compile CoreML Models\n**Author**: [Joshua Z. Zhang](https://zhreshold.github.io/),             [Kazutaka Morita](https://github.com/kazum),             [Zhao Wu](https://github.com/FrozenGene)\n\nThis article is an introductory tutorial to deploy CoreML models with Relay.\n\nTo begin, we must install coremltools:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install coremltools"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "or please refer to official site\nhttps://github.com/apple/coremltools\n"
       ]
     },
     {
diff --git a/docs/_downloads/ad2a7f55d615d188ad664d56696815a6/tune_network_x86.ipynb b/docs/_downloads/ad2a7f55d615d188ad664d56696815a6/tune_network_x86.ipynb
index 4edbb9b5a0..8f1c80018b 100644
--- a/docs/_downloads/ad2a7f55d615d188ad664d56696815a6/tune_network_x86.ipynb
+++ b/docs/_downloads/ad2a7f55d615d188ad664d56696815a6/tune_network_x86.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/af264436d049e3cd84803b67b6620b63/tune_network_cuda.ipynb b/docs/_downloads/af264436d049e3cd84803b67b6620b63/tune_network_cuda.ipynb
index a4f4910f98..62ebdff2f4 100644
--- a/docs/_downloads/af264436d049e3cd84803b67b6620b63/tune_network_cuda.ipynb
+++ b/docs/_downloads/af264436d049e3cd84803b67b6620b63/tune_network_cuda.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/b11795df0596a55e4982bf895d0c8c38/bring_your_own_datatypes.ipynb b/docs/_downloads/b11795df0596a55e4982bf895d0c8c38/bring_your_own_datatypes.ipynb
index 091930b2e1..d5c2ba50d7 100644
--- a/docs/_downloads/b11795df0596a55e4982bf895d0c8c38/bring_your_own_datatypes.ipynb
+++ b/docs/_downloads/b11795df0596a55e4982bf895d0c8c38/bring_your_own_datatypes.ipynb
@@ -8,14 +8,14 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Bring Your Own Datatypes to TVM\n**Authors**: [Gus Smith](https://github.com/gussmith23), [Andrew Liu](https://github.com/hypercubestart)\n\nIn this tutorial, we will show you how to utilize the Bring Your Own Datatypes framework to use your own custom datatypes in TVM.\nNote that the Bring Your Own Datatypes framework currently only handles **software emulated versions of datatypes**.\nThe framework does not support compiling for custom accelerator datatypes out-of-the-box. [...]
+        "\n# Bring Your Own Datatypes to TVM\n**Authors**: [Gus Smith](https://github.com/gussmith23), [Andrew Liu](https://github.com/hypercubestart)\n\nIn this tutorial, we will show you how to utilize the Bring Your Own Datatypes framework to use your own custom datatypes in TVM.\nNote that the Bring Your Own Datatypes framework currently only handles **software emulated versions of datatypes**.\nThe framework does not support compiling for custom accelerator datatypes out-of-the-box. [...]
       ]
     },
     {
diff --git a/docs/_downloads/b1b0cbd807166348a0eabbad6bfbbdaf/tune_relay_vta.ipynb b/docs/_downloads/b1b0cbd807166348a0eabbad6bfbbdaf/tune_relay_vta.ipynb
index 96be3db576..494c3b350b 100644
--- a/docs/_downloads/b1b0cbd807166348a0eabbad6bfbbdaf/tune_relay_vta.ipynb
+++ b/docs/_downloads/b1b0cbd807166348a0eabbad6bfbbdaf/tune_relay_vta.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/b3f997c945cc7de3e03a1e0c4c73fabd/convolution_opt.ipynb b/docs/_downloads/b3f997c945cc7de3e03a1e0c4c73fabd/convolution_opt.ipynb
index e55ccf8f4b..a091531e6a 100644
--- a/docs/_downloads/b3f997c945cc7de3e03a1e0c4c73fabd/convolution_opt.ipynb
+++ b/docs/_downloads/b3f997c945cc7de3e03a1e0c4c73fabd/convolution_opt.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/b52cec46baf4f78d6bcd94cbe269c8a6/micro_train.py b/docs/_downloads/b52cec46baf4f78d6bcd94cbe269c8a6/micro_train.py
index 44e0dd5cb7..9b8a9a68dd 100644
--- a/docs/_downloads/b52cec46baf4f78d6bcd94cbe269c8a6/micro_train.py
+++ b/docs/_downloads/b52cec46baf4f78d6bcd94cbe269c8a6/micro_train.py
@@ -27,17 +27,6 @@ deployed to Arduino using TVM.
 """
 
 ######################################################################
-# .. note::
-#
-#   This tutorial is best viewed as a Jupyter Notebook. You can download and run it locally
-#   using the link at the bottom of this page, or open it online for free using Google Colab.
-#   Click the icon below to open in Google Colab.
-#
-# .. image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.png
-#      :align: center
-#      :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/a7c7ea4b5017ae70db1f51dd8e6dcd82/micro_train.ipynb
-#      :width: 300px
-#
 # Motivation
 # ----------
 # When building IOT devices, we often want them to **see and understand** the world around them.
@@ -71,7 +60,7 @@ deployed to Arduino using TVM.
 #
 #     .. code-block:: bash
 #
-#       %%bash
+#       %%shell
 #       pip install -q tensorflow tflite
 #       pip install -q tlcpack-nightly -f https://tlcpack.ai/wheels
 #       apt-get -qq install imagemagick curl
@@ -515,7 +504,7 @@ arduino_project = tvm.micro.generate_project(
 #
 #     .. code-block:: bash
 #
-#       %%bash
+#       %%shell
 #       mkdir -p ~/tests
 #       curl "https://i.imgur.com/JBbEhxN.png" -o ~/tests/car_224.png
 #       convert ~/tests/car_224.png -resize 64 ~/tests/car_64.png
diff --git a/docs/_downloads/b78f1a6e1b2c2fb073a791dc258a1d7d/schedule_primitives.ipynb b/docs/_downloads/b78f1a6e1b2c2fb073a791dc258a1d7d/schedule_primitives.ipynb
index e9c9579ee9..015bba396f 100644
--- a/docs/_downloads/b78f1a6e1b2c2fb073a791dc258a1d7d/schedule_primitives.ipynb
+++ b/docs/_downloads/b78f1a6e1b2c2fb073a791dc258a1d7d/schedule_primitives.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/b954238c1884e83b45d2ae543d824f03/using_relay_viz.ipynb b/docs/_downloads/b954238c1884e83b45d2ae543d824f03/using_relay_viz.ipynb
index 8d296f5793..4244b3e7d8 100644
--- a/docs/_downloads/b954238c1884e83b45d2ae543d824f03/using_relay_viz.ipynb
+++ b/docs/_downloads/b954238c1884e83b45d2ae543d824f03/using_relay_viz.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Use Relay Visualizer to Visualize Relay\n**Author**: [Chi-Wei Wang](https://github.com/chiwwang)\n\nRelay IR module can contain lots of operations. Although an individual\noperation is usually easy to understand, putting them together can cause\na complicated, hard-to-read graph. Things can get even worse with optimization-passes\ncoming into play.\n\nThis utility visualizes an IR module as nodes and edges. It defines a set of interfaces including\nparser, plotter(renderer), [...]
+        "\n# Use Relay Visualizer to Visualize Relay\n**Author**: [Chi-Wei Wang](https://github.com/chiwwang)\n\nRelay IR module can contain lots of operations. Although an individual\noperation is usually easy to understand, putting them together can cause\na complicated, hard-to-read graph. Things can get even worse with optimization-passes\ncoming into play.\n\nThis utility visualizes an IR module as nodes and edges. It defines a set of interfaces including\nparser, plotter(renderer), [...]
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install graphviz"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "For more details, please refer to :py:mod:`tvm.contrib.relay_viz`.\n"
       ]
     },
     {
diff --git a/docs/_downloads/b9e7311d8c56eb6e6aca08f0be35ff03/deploy_model_on_adreno.ipynb b/docs/_downloads/b9e7311d8c56eb6e6aca08f0be35ff03/deploy_model_on_adreno.ipynb
index eb0a669a03..50fde2e6a2 100644
--- a/docs/_downloads/b9e7311d8c56eb6e6aca08f0be35ff03/deploy_model_on_adreno.ipynb
+++ b/docs/_downloads/b9e7311d8c56eb6e6aca08f0be35ff03/deploy_model_on_adreno.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n\n# Deploy the Pretrained Model on Adreno\n**Author**: Daniil Barinov\n\nThis article is a step-by-step tutorial to deploy pretrained Pytorch ResNet-18 model on Adreno (on different precisions).\n\nFor us to begin with, PyTorch must be installed.\nTorchVision is also required since we will be using it as our model zoo.\n\nA quick solution is to install it via pip:\n\n```bash\npip install torch\npip install torchvision\n```\nBesides that, you should have TVM builded for Android [...]
+        "\n\n# Deploy the Pretrained Model on Adreno\n**Author**: Daniil Barinov\n\nThis article is a step-by-step tutorial to deploy pretrained Pytorch ResNet-18 model on Adreno (on different precisions).\n\nFor us to begin with, PyTorch must be installed.\nTorchVision is also required since we will be using it as our model zoo.\n\nA quick solution is to install it via pip:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install torch\npip install torchvision"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Besides that, you should have TVM builded for Android.\nSee the following instructions on how to build it.\n\n[Deploy to Adreno GPU](https://tvm.apache.org/docs/how_to/deploy/adreno.html)\n\nAfter the build section there should be two files in *build* directory \u00ablibtvm_runtime.so\u00bb and \u00abtvm_rpc\u00bb.\nLet's push them to the device and run TVM RPC Server.\n"
       ]
     },
     {
diff --git a/docs/_downloads/bc33c0d33026b287306b6ead1a50b04a/tune_relay_arm.ipynb b/docs/_downloads/bc33c0d33026b287306b6ead1a50b04a/tune_relay_arm.ipynb
index 7b1cdf6625..3d71183ce0 100644
--- a/docs/_downloads/bc33c0d33026b287306b6ead1a50b04a/tune_relay_arm.ipynb
+++ b/docs/_downloads/bc33c0d33026b287306b6ead1a50b04a/tune_relay_arm.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/c00933f3fbcf90c4f584d54607b33805/micro_aot.ipynb b/docs/_downloads/c00933f3fbcf90c4f584d54607b33805/micro_aot.ipynb
index dda117026a..52125163f9 100644
--- a/docs/_downloads/c00933f3fbcf90c4f584d54607b33805/micro_aot.ipynb
+++ b/docs/_downloads/c00933f3fbcf90c4f584d54607b33805/micro_aot.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
@@ -18,6 +18,78 @@
         "\n\n# microTVM Host-Driven AoT\n**Authors**:\n[Mehrdad Hessar](https://github.com/mehrdadh),\n[Alan MacDonald](https://github.com/alanmacd)\n\nThis tutorial is showcasing microTVM host-driven AoT compilation with\na TFLite model. AoTExecutor reduces the overhead of parsing graph at runtime\ncompared to GraphExecutor. Also, we can have better memory management using ahead\nof time compilation. This tutorial can be executed on a x86 CPU using C runtime (CRT)\nor on Zephyr platform [...]
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Install microTVM Python dependencies\n\nTVM does not include a package for Python serial communication, so\nwe must install one before using microTVM. We will also need TFLite\nto load models.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install pyserial==3.5 tflite==2.1"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import os\n\n# By default, this tutorial runs on x86 CPU using TVM's C runtime. If you would like\n# to run on real Zephyr hardware, you must export the `TVM_MICRO_USE_HW` environment\n# variable. Otherwise (if you are using the C runtime), you can skip installing\n# Zephyr and CMSIS-NN. It takes ~20 minutes to install both of them.\nuse_physical_hw = bool(os.getenv(\"TVM_MICRO_USE_HW\"))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Install Zephyr\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\n# Install west and ninja\npython3 -m pip install west\napt-get install -y ninja-build\n\n# Install ZephyrProject\nZEPHYR_PROJECT_PATH=\"/content/zephyrproject\"\nexport ZEPHYR_BASE=${ZEPHYR_PROJECT_PATH}/zephyr\nwest init ${ZEPHYR_PROJECT_PATH}\ncd ${ZEPHYR_BASE}\ngit checkout v2.7-branch\ncd ..\nwest update\nwest zephyr-export\nchmod -R o+w ${ZEPHYR_PROJECT_PATH}\n\n# Install Zephyr SDK\nZEPHYR_SDK_VERSION=0.13.2\nZEPHYR_SDK_FILE=\"/content/zephyr-sdk-linux-setup.run\" [...]
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Install CMSIS-NN\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\nCMSIS_SHA=\"51263182d16c92649a48144ba56c0945f9fce60e\"\nCMSIS_URL=\"http://github.com/ARM-software/CMSIS_5/archive/${CMSIS_SHA}.tar.gz\"\nexport CMSIS_PATH=/content/cmsis\nDOWNLOAD_PATH=\"/content/${CMSIS_SHA}.tar.gz\"\nmkdir ${CMSIS_PATH}\nwget ${CMSIS_URL} -O \"${DOWNLOAD_PATH}\"\ntar -xf \"${DOWNLOAD_PATH}\" -C ${CMSIS_PATH} --strip-components=1\nrm ${DOWNLOAD_PATH}"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Import Python dependencies\n\n\n"
+      ]
+    },
     {
       "cell_type": "code",
       "execution_count": null,
@@ -26,7 +98,7 @@
       },
       "outputs": [],
       "source": [
-        "import numpy as np\nimport pathlib\nimport json\nimport os\n\nimport tvm\nfrom tvm import relay\nfrom tvm.relay.backend import Executor, Runtime\nfrom tvm.contrib.download import download_testdata"
+        "import numpy as np\nimport pathlib\nimport json\n\nimport tvm\nfrom tvm import relay\nfrom tvm.relay.backend import Executor, Runtime\nfrom tvm.contrib.download import download_testdata"
       ]
     },
     {
@@ -44,7 +116,7 @@
       },
       "outputs": [],
       "source": [
-        "use_physical_hw = bool(os.getenv(\"TVM_MICRO_USE_HW\"))\nMODEL_URL = \"https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/model/keyword_spotting_quant.tflite\"\nMODEL_PATH = download_testdata(MODEL_URL, \"keyword_spotting_quant.tflite\", module=\"model\")\nSAMPLE_URL = \"https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/data/keyword_spotting_int8_6.pyc.npy\"\nSAMPLE_PATH = download_testdata(SAMPLE_URL, \"keyword_spotting_int8_6.pyc.npy\", module=\"data [...]
+        "MODEL_URL = \"https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/model/keyword_spotting_quant.tflite\"\nMODEL_PATH = download_testdata(MODEL_URL, \"keyword_spotting_quant.tflite\", module=\"model\")\nSAMPLE_URL = \"https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/data/keyword_spotting_int8_6.pyc.npy\"\nSAMPLE_PATH = download_testdata(SAMPLE_URL, \"keyword_spotting_int8_6.pyc.npy\", module=\"data\")\n\ntflite_model_buf = open(MODEL_PATH, \"rb\").read() [...]
       ]
     },
     {
@@ -98,7 +170,7 @@
       },
       "outputs": [],
       "source": [
-        "template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects(\"crt\"))\nproject_options = {}  # You can use options to provide platform-specific options through TVM.\n\nif use_physical_hw:\n    template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects(\"zephyr\"))\n    project_options = {\n        \"project_type\": \"host_driven\",\n        \"board\": BOARD,\n        \"serial_number\": SERIAL,\n        \"config_main_stack_size\": 4096,\n    } [...]
+        "template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects(\"crt\"))\nproject_options = {}  # You can use options to provide platform-specific options through TVM.\n\nif use_physical_hw:\n    template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects(\"zephyr\"))\n    project_options = {\n        \"project_type\": \"host_driven\",\n        \"board\": BOARD,\n        \"serial_number\": SERIAL,\n        \"config_main_stack_size\": 4096,\n      [...]
       ]
     },
     {
diff --git a/docs/_downloads/c20f81a94729f461f33b52cc110fd9d6/deploy_prequantized.ipynb b/docs/_downloads/c20f81a94729f461f33b52cc110fd9d6/deploy_prequantized.ipynb
index c045c50237..c6b7bfeb7a 100644
--- a/docs/_downloads/c20f81a94729f461f33b52cc110fd9d6/deploy_prequantized.ipynb
+++ b/docs/_downloads/c20f81a94729f461f33b52cc110fd9d6/deploy_prequantized.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/c23f7654585d9b0fa2129e1765b2a8f2/from_keras.py b/docs/_downloads/c23f7654585d9b0fa2129e1765b2a8f2/from_keras.py
index 895a601ada..ac961ca16a 100644
--- a/docs/_downloads/c23f7654585d9b0fa2129e1765b2a8f2/from_keras.py
+++ b/docs/_downloads/c23f7654585d9b0fa2129e1765b2a8f2/from_keras.py
@@ -19,7 +19,7 @@ Compile Keras Models
 =====================
 **Author**: `Yuwei Hu <https://Huyuwei.github.io/>`_
 
-This article is an introductory tutorial to deploy keras models with Relay.
+This article is an introductory tutorial to deploy Keras models with Relay.
 
 For us to begin with, keras should be installed.
 Tensorflow is also required since it's used as the default backend of keras.
@@ -28,14 +28,15 @@ A quick solution is to install via pip
 
 .. code-block:: bash
 
-    pip install -U keras --user
-    pip install -U tensorflow --user
+    %%shell
+    pip install keras tensorflow
 
 or please refer to official site
 https://keras.io/#installation
 """
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/c82f632d47458e76d2af9821b6778e36/from_keras.ipynb b/docs/_downloads/c82f632d47458e76d2af9821b6778e36/from_keras.ipynb
index 984e0f998d..8db7e39786 100644
--- a/docs/_downloads/c82f632d47458e76d2af9821b6778e36/from_keras.ipynb
+++ b/docs/_downloads/c82f632d47458e76d2af9821b6778e36/from_keras.ipynb
@@ -8,14 +8,32 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Compile Keras Models\n**Author**: [Yuwei Hu](https://Huyuwei.github.io/)\n\nThis article is an introductory tutorial to deploy keras models with Relay.\n\nFor us to begin with, keras should be installed.\nTensorflow is also required since it's used as the default backend of keras.\n\nA quick solution is to install via pip\n\n```bash\npip install -U keras --user\npip install -U tensorflow --user\n```\nor please refer to official site\nhttps://keras.io/#installation\n"
+        "\n# Compile Keras Models\n**Author**: [Yuwei Hu](https://Huyuwei.github.io/)\n\nThis article is an introductory tutorial to deploy Keras models with Relay.\n\nFor us to begin with, keras should be installed.\nTensorflow is also required since it's used as the default backend of keras.\n\nA quick solution is to install via pip\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install keras tensorflow"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "or please refer to official site\nhttps://keras.io/#installation\n"
       ]
     },
     {
diff --git a/docs/_downloads/c9bb7875c6ca5b2da162e177d3c9aac0/tensor_ir_blitz_course.ipynb b/docs/_downloads/c9bb7875c6ca5b2da162e177d3c9aac0/tensor_ir_blitz_course.ipynb
index 325c4943e6..f4996393ae 100644
--- a/docs/_downloads/c9bb7875c6ca5b2da162e177d3c9aac0/tensor_ir_blitz_course.ipynb
+++ b/docs/_downloads/c9bb7875c6ca5b2da162e177d3c9aac0/tensor_ir_blitz_course.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
diff --git a/docs/_downloads/cafefaac0e14b00fd7644da616cab35a/deploy_model_on_nano.ipynb b/docs/_downloads/cafefaac0e14b00fd7644da616cab35a/deploy_model_on_nano.ipynb
index 09bc319338..35fbb577d3 100644
--- a/docs/_downloads/cafefaac0e14b00fd7644da616cab35a/deploy_model_on_nano.ipynb
+++ b/docs/_downloads/cafefaac0e14b00fd7644da616cab35a/deploy_model_on_nano.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
diff --git a/docs/_downloads/cb089f2129f9829a01cc54eb81528811/using_relay_viz.py b/docs/_downloads/cb089f2129f9829a01cc54eb81528811/using_relay_viz.py
index 2e68ce9028..ae22fe20e1 100644
--- a/docs/_downloads/cb089f2129f9829a01cc54eb81528811/using_relay_viz.py
+++ b/docs/_downloads/cb089f2129f9829a01cc54eb81528811/using_relay_viz.py
@@ -32,6 +32,13 @@ A default parser is provided. Users can implement their own renderers to render
 Here we use a renderer rendering graph in the text-form.
 It is a lightweight, AST-like visualizer, inspired by `clang ast-dump <https://clang.llvm.org/docs/IntroductionToTheClangAST.html>`_.
 We will introduce how to implement customized parsers and renderers through interface classes.
+To install dependencies, run:
+
+.. code-block:: bash
+
+    %%shell
+    pip install graphviz
+
 
 For more details, please refer to :py:mod:`tvm.contrib.relay_viz`.
 """
diff --git a/docs/_downloads/cc6d9aebd24d54d81752590cbc8f99f9/relay_quick_start.py b/docs/_downloads/cc6d9aebd24d54d81752590cbc8f99f9/relay_quick_start.py
index 8910817c21..e59f0107f9 100644
--- a/docs/_downloads/cc6d9aebd24d54d81752590cbc8f99f9/relay_quick_start.py
+++ b/docs/_downloads/cc6d9aebd24d54d81752590cbc8f99f9/relay_quick_start.py
@@ -27,6 +27,7 @@ Notice that you need to build TVM with cuda and llvm enabled.
 """
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/d1434e80dd27eef6b1c9cbaa13f1197b/tune_relay_cuda.ipynb b/docs/_downloads/d1434e80dd27eef6b1c9cbaa13f1197b/tune_relay_cuda.ipynb
index 198fc9d236..9d3728f3ab 100644
--- a/docs/_downloads/d1434e80dd27eef6b1c9cbaa13f1197b/tune_relay_cuda.ipynb
+++ b/docs/_downloads/d1434e80dd27eef6b1c9cbaa13f1197b/tune_relay_cuda.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI, with CUDA enabled. To use this,\n# you must request a Google Colab instance with a GPU by going to Runtime ->\n# Change runtime type -> Hardware accelerator -> GPU. If you wish to build from\n# source, see see https://tvm.apache.org/docs/install/from_source.html\npip install tlcpack-nightly-cu113 --pre -f https://tlcpack.ai/wheels"
       ]
     },
     {
diff --git a/docs/_downloads/d58ec306b89044968adefb49e6552378/low_level_custom_pass.ipynb b/docs/_downloads/d58ec306b89044968adefb49e6552378/low_level_custom_pass.ipynb
index b1b35602d6..19839c25d1 100644
--- a/docs/_downloads/d58ec306b89044968adefb49e6552378/low_level_custom_pass.ipynb
+++ b/docs/_downloads/d58ec306b89044968adefb49e6552378/low_level_custom_pass.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/d92aacfae35477bed0f7f60aa8d2714e/deploy_ssd_gluoncv.ipynb b/docs/_downloads/d92aacfae35477bed0f7f60aa8d2714e/deploy_ssd_gluoncv.ipynb
index b90d637de3..11830c70c0 100644
--- a/docs/_downloads/d92aacfae35477bed0f7f60aa8d2714e/deploy_ssd_gluoncv.ipynb
+++ b/docs/_downloads/d92aacfae35477bed0f7f60aa8d2714e/deploy_ssd_gluoncv.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/dabb6b43ea9ef9d7bd1a3912001deace/build_gcn.py b/docs/_downloads/dabb6b43ea9ef9d7bd1a3912001deace/build_gcn.py
index 8953ffc2e4..e6106dd95b 100644
--- a/docs/_downloads/dabb6b43ea9ef9d7bd1a3912001deace/build_gcn.py
+++ b/docs/_downloads/dabb6b43ea9ef9d7bd1a3912001deace/build_gcn.py
@@ -25,7 +25,13 @@ In this tutorial, we will run our GCN on Cora dataset to demonstrate.
 Cora dataset is a common benchmark for Graph Neural Networks (GNN) and frameworks that support GNN training and inference.
 We directly load the dataset from DGL library to do the apples to apples comparison against DGL.
 
-Please refer to DGL doc for DGL installation at
+.. code-block:: bash
+
+    %%shell
+    pip install torch==1.9.0
+    pip install dgl==v0.7.2 -f https://data.dgl.ai/wheels/repo.html
+
+Please refer to DGL doc for installation at
 https://docs.dgl.ai/install/index.html.
 
 Please refer to PyTorch guide for PyTorch installation at
diff --git a/docs/_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py b/docs/_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py
index 5d173e3812..7964694e68 100644
--- a/docs/_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py
+++ b/docs/_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py
@@ -38,6 +38,7 @@ __name__ == "__main__":` block.
 """
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/eb551cfff8900ec35fae9f15aa728e45/from_onnx.py b/docs/_downloads/eb551cfff8900ec35fae9f15aa728e45/from_onnx.py
index f0256bc7d3..980091d391 100644
--- a/docs/_downloads/eb551cfff8900ec35fae9f15aa728e45/from_onnx.py
+++ b/docs/_downloads/eb551cfff8900ec35fae9f15aa728e45/from_onnx.py
@@ -21,15 +21,14 @@ Compile ONNX Models
 
 This article is an introductory tutorial to deploy ONNX models with Relay.
 
-For us to begin with, ONNX package must be installed.
-
-A quick solution is to install protobuf compiler, and
+To begin, install the ONNX package:
 
 .. code-block:: bash
 
-    pip install --user onnx onnxoptimizer
+    %%shell
+    pip install onnx onnxoptimizer
 
-or please refer to official site.
+Alternatively, you can refer to official site:
 https://github.com/onnx/onnx
 """
 
diff --git a/docs/_downloads/edc9d28c4fbc249e2e7b78002af63b84/using_external_lib.ipynb b/docs/_downloads/edc9d28c4fbc249e2e7b78002af63b84/using_external_lib.ipynb
index ee6d42c75a..64e90366af 100644
--- a/docs/_downloads/edc9d28c4fbc249e2e7b78002af63b84/using_external_lib.ipynb
+++ b/docs/_downloads/edc9d28c4fbc249e2e7b78002af63b84/using_external_lib.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/ee99205e9f2e4f54c0fb7925008a5354/bring_your_own_datatypes.py b/docs/_downloads/ee99205e9f2e4f54c0fb7925008a5354/bring_your_own_datatypes.py
index 479269a224..bbd207dbac 100644
--- a/docs/_downloads/ee99205e9f2e4f54c0fb7925008a5354/bring_your_own_datatypes.py
+++ b/docs/_downloads/ee99205e9f2e4f54c0fb7925008a5354/bring_your_own_datatypes.py
@@ -47,7 +47,7 @@ Since we do not use any 3rdparty library, there is no setup needed.
 
 If you would like to try this with your own datatype library, first bring the library's functions into the process space with ``CDLL``:
 
-.. code-block :: python
+.. code-block:: python
 
     ctypes.CDLL('my-datatype-lib.so', ctypes.RTLD_GLOBAL)
 """
diff --git a/docs/_downloads/eed2658f15243bab719b2de7769fa45a/deploy_model_on_android.ipynb b/docs/_downloads/eed2658f15243bab719b2de7769fa45a/deploy_model_on_android.ipynb
index b0aea16665..e027344900 100644
--- a/docs/_downloads/eed2658f15243bab719b2de7769fa45a/deploy_model_on_android.ipynb
+++ b/docs/_downloads/eed2658f15243bab719b2de7769fa45a/deploy_model_on_android.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/efe0b02e219b28e0bd85fbdda35ba8ac/tvmc_command_line_driver.ipynb b/docs/_downloads/efe0b02e219b28e0bd85fbdda35ba8ac/tvmc_command_line_driver.ipynb
index 3436200865..9d83da7925 100644
--- a/docs/_downloads/efe0b02e219b28e0bd85fbdda35ba8ac/tvmc_command_line_driver.ipynb
+++ b/docs/_downloads/efe0b02e219b28e0bd85fbdda35ba8ac/tvmc_command_line_driver.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
@@ -47,14 +47,14 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: Supported model formats\n\n  TVMC supports models created with Keras, ONNX, TensorFlow, TFLite\n  and Torch. Use the option ``--model-format`` if you need to\n  explicitly provide the model format you are using. See ``tvmc\n  compile --help`` for more information.\n\n\n"
+        "<div class=\"alert alert-info\"><h4>Supported model formats</h4><p>TVMC supports models created with Keras, ONNX, TensorFlow, TFLite\nand Torch. Use the option ``--model-format`` if you need to\nexplicitly provide the model format you are using. See ``tvmc\ncompile --help`` for more information.</p></div>"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: Adding ONNX Support to TVM\n\n   TVM relies on the ONNX python library being available on your system. You can\n   install ONNX using the command ``pip3 install --user onnx onnxoptimizer``. You\n   may remove the ``--user`` option if you have root access and want to install\n   ONNX globally.  The ``onnxoptimizer`` dependency is optional, and is only used\n   for ``onnx>=1.9``.\n\n\n"
+        "<div class=\"alert alert-info\"><h4>Adding ONNX Support to TVM</h4><p>TVM relies on the ONNX python library being available on your system. You can\ninstall ONNX using the command ``pip3 install --user onnx onnxoptimizer``. You\nmay remove the ``--user`` option if you have root access and want to install\nONNX globally.  The ``onnxoptimizer`` dependency is optional, and is only used\nfor ``onnx>=1.9``.</p></div>"
       ]
     },
     {
@@ -68,7 +68,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        ".. admonition:: Defining the Correct Target\n\n  Specifying the correct target (option ``--target``) can have a huge\n  impact on the performance of the compiled module, as it can take\n  advantage of hardware features available on the target. For more\n  information, please refer to `Auto-tuning a convolutional network for\n  x86 CPU <tune_relay_x86>`. We recommend identifying which CPU you are\n  running, along with optional features, and set the target appropriately.\n\n"
+        "<div class=\"alert alert-info\"><h4>Defining the Correct Target</h4><p>Specifying the correct target (option ``--target``) can have a huge\nimpact on the performance of the compiled module, as it can take\nadvantage of hardware features available on the target. For more\ninformation, please refer to `Auto-tuning a convolutional network for\nx86 CPU <tune_relay_x86>`. We recommend identifying which CPU you are\nrunning, along with optional features, and set the target appropriate [...]
       ]
     },
     {
@@ -82,7 +82,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "### Input pre-processing\n\nFor our ResNet-50 v2 model, the input is expected to be in ImageNet format.\nHere is an example of a script to pre-process an image for ResNet-50 v2.\n\nYou will need to have a supported version of the Python Image Library\ninstalled. You can use ``pip3 install --user pillow`` to satisfy this\nrequirement for the script.\n\n.. code-block:: python\n    :caption: preprocess.py\n    :name: preprocess.py\n\n    #!python ./preprocess.py\n    from tvm.contr [...]
+        "### Input pre-processing\n\nFor our ResNet-50 v2 model, the input is expected to be in ImageNet format.\nHere is an example of a script to pre-process an image for ResNet-50 v2.\n\nYou will need to have a supported version of the Python Image Library\ninstalled. You can use ``pip3 install --user pillow`` to satisfy this\nrequirement for the script.\n\n"
       ]
     },
     {
@@ -96,14 +96,14 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "### Output Post-Processing\n\nAs previously mentioned, each model will have its own particular way of\nproviding output tensors.\n\nIn our case, we need to run some post-processing to render the outputs from\nResNet-50 v2 into a more human-readable form, using the lookup-table provided\nfor the model.\n\nThe script below shows an example of the post-processing to extract labels\nfrom the output of our compiled module.\n\n.. code-block:: python\n    :caption: postprocess.py\n     [...]
+        "### Output Post-Processing\n\nAs previously mentioned, each model will have its own particular way of\nproviding output tensors.\n\nIn our case, we need to run some post-processing to render the outputs from\nResNet-50 v2 into a more human-readable form, using the lookup-table provided\nfor the model.\n\nThe script below shows an example of the post-processing to extract labels\nfrom the output of our compiled module.\n\nRunning this script should produce the following output:\n [...]
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Automatically Tuning the ResNet Model\n\nThe previous model was compiled to work on the TVM runtime, but did not\ninclude any platform specific optimization. In this section, we will show you\nhow to build an optimized model using TVMC to target your working platform.\n\nIn some cases, we might not get the expected performance when running\ninferences using our compiled module.  In cases like this, we can make use of\nthe auto-tuner, to find a better configuration for our mod [...]
+        "## Automatically Tuning the ResNet Model\n\nThe previous model was compiled to work on the TVM runtime, but did not\ninclude any platform specific optimization. In this section, we will show you\nhow to build an optimized model using TVMC to target your working platform.\n\nIn some cases, we might not get the expected performance when running\ninferences using our compiled module.  In cases like this, we can make use of\nthe auto-tuner, to find a better configuration for our mod [...]
       ]
     },
     {
diff --git a/docs/_downloads/f289ca2466fcf79c024068c1f8642bd0/cross_compilation_and_rpc.ipynb b/docs/_downloads/f289ca2466fcf79c024068c1f8642bd0/cross_compilation_and_rpc.ipynb
index 8096930fe0..892781ed49 100644
--- a/docs/_downloads/f289ca2466fcf79c024068c1f8642bd0/cross_compilation_and_rpc.ipynb
+++ b/docs/_downloads/f289ca2466fcf79c024068c1f8642bd0/cross_compilation_and_rpc.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/f407f66fb8174d0d4ec37407af1128d6/using_pipeline_executor.ipynb b/docs/_downloads/f407f66fb8174d0d4ec37407af1128d6/using_pipeline_executor.ipynb
index b0314b69bb..08bbf32eec 100644
--- a/docs/_downloads/f407f66fb8174d0d4ec37407af1128d6/using_pipeline_executor.ipynb
+++ b/docs/_downloads/f407f66fb8174d0d4ec37407af1128d6/using_pipeline_executor.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/f6ff0fbc61d45d2cc0f53ebbf11a5fb5/use_pass_instrument.ipynb b/docs/_downloads/f6ff0fbc61d45d2cc0f53ebbf11a5fb5/use_pass_instrument.ipynb
index 942b2238c8..7ef2f98624 100644
--- a/docs/_downloads/f6ff0fbc61d45d2cc0f53ebbf11a5fb5/use_pass_instrument.ipynb
+++ b/docs/_downloads/f6ff0fbc61d45d2cc0f53ebbf11a5fb5/use_pass_instrument.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
diff --git a/docs/_downloads/f7ae979fbe61064749ce0fb7a621eb4c/from_oneflow.py b/docs/_downloads/f7ae979fbe61064749ce0fb7a621eb4c/from_oneflow.py
index eb27c4b3e3..0925c9fe81 100644
--- a/docs/_downloads/f7ae979fbe61064749ce0fb7a621eb4c/from_oneflow.py
+++ b/docs/_downloads/f7ae979fbe61064749ce0fb7a621eb4c/from_oneflow.py
@@ -27,8 +27,9 @@ A quick solution is to install via pip
 
 .. code-block:: bash
 
+    %%shell
     pip install flowvision==0.1.0
-    python3 -m pip install -f https://release.oneflow.info oneflow==0.7.0+cpu
+    pip install -f https://release.oneflow.info oneflow==0.7.0+cpu
 
 or please refer to official site:
 https://github.com/Oneflow-Inc/oneflow
@@ -37,6 +38,7 @@ Currently, TVM supports OneFlow 0.7.0. Other versions may be unstable.
 """
 
 # sphinx_gallery_start_ignore
+# sphinx_gallery_requires_cuda = True
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
diff --git a/docs/_downloads/f83ba3df2d52f9b54cf141114359481a/micro_autotune.ipynb b/docs/_downloads/f83ba3df2d52f9b54cf141114359481a/micro_autotune.ipynb
index 212179f46a..272e54ae35 100644
--- a/docs/_downloads/f83ba3df2d52f9b54cf141114359481a/micro_autotune.ipynb
+++ b/docs/_downloads/f83ba3df2d52f9b54cf141114359481a/micro_autotune.ipynb
@@ -8,7 +8,7 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
@@ -18,6 +18,78 @@
         "\n\n# Autotuning with microTVM\n**Authors**:\n[Andrew Reusch](https://github.com/areusch),\n[Mehrdad Hessar](https://github.com/mehrdadh)\n\nThis tutorial explains how to autotune a model using the C runtime.\n"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Install microTVM Python dependencies\n\nTVM does not include a package for Python serial communication, so\nwe must install one before using microTVM. We will also need TFLite\nto load models.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install pyserial==3.5 tflite==2.1"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "# You can skip the following two sections (installing Zephyr and CMSIS-NN) if the following flag is False.\n# Installing Zephyr takes ~20 min.\nimport os\n\nuse_physical_hw = bool(os.getenv(\"TVM_MICRO_USE_HW\"))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Install Zephyr\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\n# Install west and ninja\npython3 -m pip install west\napt-get install -y ninja-build\n\n# Install ZephyrProject\nZEPHYR_PROJECT_PATH=\"/content/zephyrproject\"\nexport ZEPHYR_BASE=${ZEPHYR_PROJECT_PATH}/zephyr\nwest init ${ZEPHYR_PROJECT_PATH}\ncd ${ZEPHYR_BASE}\ngit checkout v2.7-branch\ncd ..\nwest update\nwest zephyr-export\nchmod -R o+w ${ZEPHYR_PROJECT_PATH}\n\n# Install Zephyr SDK\nZEPHYR_SDK_VERSION=0.13.2\nZEPHYR_SDK_FILE=\"/content/zephyr-sdk-linux-setup.run\" [...]
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Install CMSIS-NN\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\nCMSIS_SHA=\"51263182d16c92649a48144ba56c0945f9fce60e\"\nCMSIS_URL=\"http://github.com/ARM-software/CMSIS_5/archive/${CMSIS_SHA}.tar.gz\"\nexport CMSIS_PATH=/content/cmsis\nDOWNLOAD_PATH=\"/content/${CMSIS_SHA}.tar.gz\"\nmkdir ${CMSIS_PATH}\nwget ${CMSIS_URL} -O \"${DOWNLOAD_PATH}\"\ntar -xf \"${DOWNLOAD_PATH}\" -C ${CMSIS_PATH} --strip-components=1\nrm ${DOWNLOAD_PATH}"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Import Python dependencies\n\n\n"
+      ]
+    },
     {
       "cell_type": "code",
       "execution_count": null,
@@ -26,14 +98,14 @@
       },
       "outputs": [],
       "source": [
-        "import os\nimport json\nimport numpy as np\nimport pathlib\n\nimport tvm\nfrom tvm.relay.backend import Runtime\n\nuse_physical_hw = bool(os.getenv(\"TVM_MICRO_USE_HW\"))"
+        "import json\nimport numpy as np\nimport pathlib\n\nimport tvm\nfrom tvm.relay.backend import Runtime"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Defining the model\n\n To begin with, define a model in Relay to be executed on-device. Then create an IRModule from relay model and\n fill parameters with random numbers.\n\n\n"
+        "### Defining the model\n\n To begin with, define a model in Relay to be executed on-device. Then create an IRModule from relay model and\n fill parameters with random numbers.\n\n\n"
       ]
     },
     {
@@ -51,7 +123,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Defining the target\n Now we define the TVM target that describes the execution environment. This looks very similar\n to target definitions from other microTVM tutorials. Alongside this we pick the C Runtime to code\n generate our model against.\n\n When running on physical hardware, choose a target and a board that\n describe the hardware. There are multiple hardware targets that could be selected from\n PLATFORM list in this tutorial. You can chose the platform by passing  [...]
+        "### Defining the target\n Now we define the TVM target that describes the execution environment. This looks very similar\n to target definitions from other microTVM tutorials. Alongside this we pick the C Runtime to code\n generate our model against.\n\n When running on physical hardware, choose a target and a board that\n describe the hardware. There are multiple hardware targets that could be selected from\n PLATFORM list in this tutorial. You can chose the platform by passing [...]
       ]
     },
     {
@@ -69,7 +141,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Extracting tuning tasks\n Not all operators in the Relay program printed above can be tuned. Some are so trivial that only\n a single implementation is defined; others don't make sense as tuning tasks. Using\n `extract_from_program`, you can produce a list of tunable tasks.\n\n Because task extraction involves running the compiler, we first configure the compiler's\n transformation passes; we'll apply the same configuration later on during autotuning.\n\n\n"
+        "### Extracting tuning tasks\n Not all operators in the Relay program printed above can be tuned. Some are so trivial that only\n a single implementation is defined; others don't make sense as tuning tasks. Using\n `extract_from_program`, you can produce a list of tunable tasks.\n\n Because task extraction involves running the compiler, we first configure the compiler's\n transformation passes; we'll apply the same configuration later on during autotuning.\n\n\n"
       ]
     },
     {
@@ -87,7 +159,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Configuring microTVM\n Before autotuning, we need to define a module loader and then pass that to\n a `tvm.autotvm.LocalBuilder`. Then we create a `tvm.autotvm.LocalRunner` and use\n both builder and runner to generates multiple measurements for auto tunner.\n\n In this tutorial, we have the option to use x86 host as an example or use different targets\n from Zephyr RTOS. If you choose pass `--platform=host` to this tutorial it will uses x86. You can\n choose other options by [...]
+        "### Configuring microTVM\n Before autotuning, we need to define a module loader and then pass that to\n a `tvm.autotvm.LocalBuilder`. Then we create a `tvm.autotvm.LocalRunner` and use\n both builder and runner to generates multiple measurements for auto tunner.\n\n In this tutorial, we have the option to use x86 host as an example or use different targets\n from Zephyr RTOS. If you choose pass `--platform=host` to this tutorial it will uses x86. You can\n choose other options b [...]
       ]
     },
     {
@@ -105,7 +177,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Run Autotuning\n Now we can run autotuning separately on each extracted task on microTVM device.\n\n\n"
+        "### Run Autotuning\n Now we can run autotuning separately on each extracted task on microTVM device.\n\n\n"
       ]
     },
     {
@@ -123,7 +195,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Timing the untuned program\n For comparison, let's compile and run the graph without imposing any autotuning schedules. TVM\n will select a randomly-tuned implementation for each operator, which should not perform as well as\n the tuned operator.\n\n\n"
+        "### Timing the untuned program\n For comparison, let's compile and run the graph without imposing any autotuning schedules. TVM\n will select a randomly-tuned implementation for each operator, which should not perform as well as\n the tuned operator.\n\n\n"
       ]
     },
     {
@@ -141,7 +213,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Timing the tuned program\n Once autotuning completes, you can time execution of the entire program using the Debug Runtime:\n\n"
+        "### Timing the tuned program\n Once autotuning completes, you can time execution of the entire program using the Debug Runtime:\n\n"
       ]
     },
     {
diff --git a/docs/_downloads/f8a7209a0e66b246185bfc41bbc82f54/micro_aot.py b/docs/_downloads/f8a7209a0e66b246185bfc41bbc82f54/micro_aot.py
index 4d6890f8d9..8646b6d7ec 100644
--- a/docs/_downloads/f8a7209a0e66b246185bfc41bbc82f54/micro_aot.py
+++ b/docs/_downloads/f8a7209a0e66b246185bfc41bbc82f54/micro_aot.py
@@ -30,16 +30,42 @@ of time compilation. This tutorial can be executed on a x86 CPU using C runtime
 or on Zephyr platform on a microcontroller/board supported by Zephyr.
 """
 
+######################################################################
+#
+#     .. include:: ../../../../gallery/how_to/work_with_microtvm/install_dependencies.rst
+#
+
 # sphinx_gallery_start_ignore
 from tvm import testing
 
 testing.utils.install_request_hook(depth=3)
 # sphinx_gallery_end_ignore
 
+import os
+
+# By default, this tutorial runs on x86 CPU using TVM's C runtime. If you would like
+# to run on real Zephyr hardware, you must export the `TVM_MICRO_USE_HW` environment
+# variable. Otherwise (if you are using the C runtime), you can skip installing
+# Zephyr and CMSIS-NN. It takes ~20 minutes to install both of them.
+use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
+
+######################################################################
+#
+#     .. include:: ../../../../gallery/how_to/work_with_microtvm/install_zephyr.rst
+#
+
+######################################################################
+#
+#     .. include:: ../../../../gallery/how_to/work_with_microtvm/install_cmsis.rst
+#
+
+######################################################################
+# Import Python dependencies
+# -------------------------------
+#
 import numpy as np
 import pathlib
 import json
-import os
 
 import tvm
 from tvm import relay
@@ -57,7 +83,6 @@ from tvm.contrib.download import download_testdata
 # **Note:** By default this tutorial runs on x86 CPU using CRT, if you would like to run on Zephyr platform
 # you need to export `TVM_MICRO_USE_HW` environment variable.
 #
-use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
 MODEL_URL = "https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/model/keyword_spotting_quant.tflite"
 MODEL_PATH = download_testdata(MODEL_URL, "keyword_spotting_quant.tflite", module="model")
 SAMPLE_URL = "https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/data/keyword_spotting_int8_6.pyc.npy"
@@ -139,6 +164,8 @@ if use_physical_hw:
         "board": BOARD,
         "serial_number": SERIAL,
         "config_main_stack_size": 4096,
+        "cmsis_path": os.getenv("CMSIS_PATH", default="/content/cmsis"),
+        "zephyr_base": os.getenv("ZEPHYR_BASE", default="/content/zephyrproject/zephyr"),
     }
 
 temp_dir = tvm.contrib.utils.tempdir()
diff --git a/docs/_downloads/f90d5f6bfd99e0d9812ae5b91503e148/from_pytorch.py b/docs/_downloads/f90d5f6bfd99e0d9812ae5b91503e148/from_pytorch.py
index 98b531fa6d..064ed70e46 100644
--- a/docs/_downloads/f90d5f6bfd99e0d9812ae5b91503e148/from_pytorch.py
+++ b/docs/_downloads/f90d5f6bfd99e0d9812ae5b91503e148/from_pytorch.py
@@ -21,15 +21,15 @@ Compile PyTorch Models
 
 This article is an introductory tutorial to deploy PyTorch models with Relay.
 
-For us to begin with, PyTorch should be installed.
-TorchVision is also required since we will be using it as our model zoo.
-
-A quick solution is to install via pip
+For us to begin, PyTorch should be installed.
+TorchVision is also required so we can use the model zoo.
+A quick solution is to install via pip:
 
 .. code-block:: bash
 
-    pip install torch==1.7.0
-    pip install torchvision==0.8.1
+    %%shell
+    pip install torch
+    pip install torchvision
 
 or please refer to official site
 https://pytorch.org/get-started/locally/
diff --git a/docs/_downloads/f97d815b408ef3f4d6bcb3e073c2d4dd/from_darknet.ipynb b/docs/_downloads/f97d815b408ef3f4d6bcb3e073c2d4dd/from_darknet.ipynb
index d7dbb1476f..3f45a07515 100644
--- a/docs/_downloads/f97d815b408ef3f4d6bcb3e073c2d4dd/from_darknet.ipynb
+++ b/docs/_downloads/f97d815b408ef3f4d6bcb3e073c2d4dd/from_darknet.ipynb
@@ -8,14 +8,25 @@
       },
       "outputs": [],
       "source": [
-        "%matplotlib inline"
+        "%%shell\n# Installs the latest dev build of TVM from PyPI. If you wish to build\n# from source, see https://tvm.apache.org/docs/install/from_source.html\npip install apache-tvm --pre"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Compile YOLO-V2 and YOLO-V3 in DarkNet Models\n**Author**: [Siju Samuel](https://siju-samuel.github.io/)\n\nThis article is an introductory tutorial to deploy darknet models with TVM.\nAll the required models and libraries will be downloaded from the internet by the script.\nThis script runs the YOLO-V2 and YOLO-V3 Model with the bounding boxes\nDarknet parsing have dependancy with CFFI and CV2 library\nPlease install CFFI and CV2 before executing this script\n\n```bash\npip [...]
+        "\n# Compile YOLO-V2 and YOLO-V3 in DarkNet Models\n**Author**: [Siju Samuel](https://siju-samuel.github.io/)\n\nThis article is an introductory tutorial to deploy darknet models with TVM.\nAll the required models and libraries will be downloaded from the internet by the script.\nThis script runs the YOLO-V2 and YOLO-V3 Model with the bounding boxes\nDarknet parsing have dependancy with CFFI and CV2 library\nPlease install CFFI and CV2 before executing this script\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%%shell\npip install cffi opencv-python"
       ]
     },
     {
diff --git a/docs/_images/sphx_glr_micro_train_001.png b/docs/_images/sphx_glr_micro_train_001.png
index fd04aec899..4730ebaecb 100644
Binary files a/docs/_images/sphx_glr_micro_train_001.png and b/docs/_images/sphx_glr_micro_train_001.png differ
diff --git a/docs/_images/sphx_glr_micro_train_thumb.png b/docs/_images/sphx_glr_micro_train_thumb.png
index 176d23232e..4f63c99e35 100644
Binary files a/docs/_images/sphx_glr_micro_train_thumb.png and b/docs/_images/sphx_glr_micro_train_thumb.png differ
diff --git a/docs/_sources/how_to/compile_models/from_coreml.rst.txt b/docs/_sources/how_to/compile_models/from_coreml.rst.txt
index d59589dbf8..dbbc2020df 100644
--- a/docs/_sources/how_to/compile_models/from_coreml.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_coreml.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/compile_models/from_coreml.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_compile_models_from_coreml.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_compile_models_from_coreml.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/a883b8474634054b6a79c17a288aa8ed/from_coreml.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -24,18 +28,17 @@ Compile CoreML Models
 
 This article is an introductory tutorial to deploy CoreML models with Relay.
 
-For us to begin with, coremltools module is required to be installed.
-
-A quick solution is to install via pip
+To begin, we must install coremltools:
 
 .. code-block:: bash
 
-    pip install -U coremltools --user
+    %%shell
+    pip install coremltools
 
 or please refer to official site
 https://github.com/apple/coremltools
 
-.. GENERATED FROM PYTHON SOURCE LINES 37-46
+.. GENERATED FROM PYTHON SOURCE LINES 36-45
 
 .. code-block:: default
 
@@ -55,14 +58,14 @@ https://github.com/apple/coremltools
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 52-56
+.. GENERATED FROM PYTHON SOURCE LINES 51-55
 
 Load pretrained CoreML model
 ----------------------------
 We will download and load a pretrained mobilenet classification network
 provided by apple in this example
 
-.. GENERATED FROM PYTHON SOURCE LINES 56-62
+.. GENERATED FROM PYTHON SOURCE LINES 55-61
 
 .. code-block:: default
 
@@ -79,13 +82,13 @@ provided by apple in this example
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 63-66
+.. GENERATED FROM PYTHON SOURCE LINES 62-65
 
 Load a test image
 ------------------
 A single cat dominates the examples!
 
-.. GENERATED FROM PYTHON SOURCE LINES 66-73
+.. GENERATED FROM PYTHON SOURCE LINES 65-72
 
 .. code-block:: default
 
@@ -103,13 +106,13 @@ A single cat dominates the examples!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 74-77
+.. GENERATED FROM PYTHON SOURCE LINES 73-76
 
 Compile the model on Relay
 ---------------------------
 We should be familiar with the process right now.
 
-.. GENERATED FROM PYTHON SOURCE LINES 77-86
+.. GENERATED FROM PYTHON SOURCE LINES 76-85
 
 .. code-block:: default
 
@@ -129,13 +132,13 @@ We should be familiar with the process right now.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 87-90
+.. GENERATED FROM PYTHON SOURCE LINES 86-89
 
 Execute on TVM
 -------------------
 The process is no different from other example
 
-.. GENERATED FROM PYTHON SOURCE LINES 90-103
+.. GENERATED FROM PYTHON SOURCE LINES 89-102
 
 .. code-block:: default
 
@@ -159,13 +162,13 @@ The process is no different from other example
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 104-107
+.. GENERATED FROM PYTHON SOURCE LINES 103-106
 
 Look up synset name
 -------------------
 Look up prediction top 1 index in 1000 class synset.
 
-.. GENERATED FROM PYTHON SOURCE LINES 107-121
+.. GENERATED FROM PYTHON SOURCE LINES 106-120
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/compile_models/from_darknet.rst.txt b/docs/_sources/how_to/compile_models/from_darknet.rst.txt
index 50ebc0a470..0bc8c99c7f 100644
--- a/docs/_sources/how_to/compile_models/from_darknet.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_darknet.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/compile_models/from_darknet.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_compile_models_from_darknet.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_compile_models_from_darknet.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/f97d815b408ef3f4d6bcb3e073c2d4dd/from_darknet.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -30,10 +34,10 @@ Please install CFFI and CV2 before executing this script
 
 .. code-block:: bash
 
-  pip install cffi
-  pip install opencv-python
+  %%shell
+  pip install cffi opencv-python
 
-.. GENERATED FROM PYTHON SOURCE LINES 33-50
+.. GENERATED FROM PYTHON SOURCE LINES 34-51
 
 .. code-block:: default
 
@@ -61,13 +65,13 @@ Please install CFFI and CV2 before executing this script
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 56-59
+.. GENERATED FROM PYTHON SOURCE LINES 57-60
 
 Choose the model
 -----------------------
 Models are: 'yolov2', 'yolov3' or 'yolov3-tiny'
 
-.. GENERATED FROM PYTHON SOURCE LINES 59-63
+.. GENERATED FROM PYTHON SOURCE LINES 60-64
 
 .. code-block:: default
 
@@ -82,13 +86,13 @@ Models are: 'yolov2', 'yolov3' or 'yolov3-tiny'
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 64-67
+.. GENERATED FROM PYTHON SOURCE LINES 65-68
 
 Download required files
 -----------------------
 Download cfg and weights file if first time.
 
-.. GENERATED FROM PYTHON SOURCE LINES 67-99
+.. GENERATED FROM PYTHON SOURCE LINES 68-100
 
 .. code-block:: default
 
@@ -137,13 +141,13 @@ Download cfg and weights file if first time.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 100-103
+.. GENERATED FROM PYTHON SOURCE LINES 101-104
 
 Import the graph to Relay
 -------------------------
 compile the model
 
-.. GENERATED FROM PYTHON SOURCE LINES 103-112
+.. GENERATED FROM PYTHON SOURCE LINES 104-113
 
 .. code-block:: default
 
@@ -169,12 +173,12 @@ compile the model
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 113-115
+.. GENERATED FROM PYTHON SOURCE LINES 114-116
 
 Load a test image
 -----------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 115-121
+.. GENERATED FROM PYTHON SOURCE LINES 116-122
 
 .. code-block:: default
 
@@ -197,13 +201,13 @@ Load a test image
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 122-125
+.. GENERATED FROM PYTHON SOURCE LINES 123-126
 
 Execute on TVM Runtime
 ----------------------
 The process is no different from other examples.
 
-.. GENERATED FROM PYTHON SOURCE LINES 125-209
+.. GENERATED FROM PYTHON SOURCE LINES 126-210
 
 .. code-block:: default
 
@@ -315,7 +319,7 @@ The process is no different from other examples.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  8.808 seconds)
+   **Total running time of the script:** ( 1 minutes  6.880 seconds)
 
 
 .. _sphx_glr_download_how_to_compile_models_from_darknet.py:
diff --git a/docs/_sources/how_to/compile_models/from_keras.rst.txt b/docs/_sources/how_to/compile_models/from_keras.rst.txt
index 00780d31c4..88a5498f1d 100644
--- a/docs/_sources/how_to/compile_models/from_keras.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_keras.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/compile_models/from_keras.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_compile_models_from_keras.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_compile_models_from_keras.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/c82f632d47458e76d2af9821b6778e36/from_keras.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -22,7 +26,7 @@ Compile Keras Models
 =====================
 **Author**: `Yuwei Hu <https://Huyuwei.github.io/>`_
 
-This article is an introductory tutorial to deploy keras models with Relay.
+This article is an introductory tutorial to deploy Keras models with Relay.
 
 For us to begin with, keras should be installed.
 Tensorflow is also required since it's used as the default backend of keras.
@@ -31,8 +35,8 @@ A quick solution is to install via pip
 
 .. code-block:: bash
 
-    pip install -U keras --user
-    pip install -U tensorflow --user
+    %%shell
+    pip install keras tensorflow
 
 or please refer to official site
 https://keras.io/#installation
@@ -57,13 +61,13 @@ https://keras.io/#installation
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 52-55
+.. GENERATED FROM PYTHON SOURCE LINES 53-56
 
 Load pretrained keras model
 ----------------------------
 We load a pretrained resnet-50 classification model provided by keras.
 
-.. GENERATED FROM PYTHON SOURCE LINES 55-80
+.. GENERATED FROM PYTHON SOURCE LINES 56-81
 
 .. code-block:: default
 
@@ -99,13 +103,13 @@ We load a pretrained resnet-50 classification model provided by keras.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 81-84
+.. GENERATED FROM PYTHON SOURCE LINES 82-85
 
 Load a test image
 ------------------
 A single cat dominates the examples!
 
-.. GENERATED FROM PYTHON SOURCE LINES 84-98
+.. GENERATED FROM PYTHON SOURCE LINES 85-99
 
 .. code-block:: default
 
@@ -141,13 +145,13 @@ A single cat dominates the examples!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 99-102
+.. GENERATED FROM PYTHON SOURCE LINES 100-103
 
 Compile the model with Relay
 ----------------------------
 convert the keras model(NHWC layout) to Relay format(NCHW layout).
 
-.. GENERATED FROM PYTHON SOURCE LINES 102-116
+.. GENERATED FROM PYTHON SOURCE LINES 103-117
 
 .. code-block:: default
 
@@ -172,12 +176,12 @@ convert the keras model(NHWC layout) to Relay format(NCHW layout).
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 117-119
+.. GENERATED FROM PYTHON SOURCE LINES 118-120
 
 Execute on TVM
 ---------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 119-123
+.. GENERATED FROM PYTHON SOURCE LINES 120-124
 
 .. code-block:: default
 
@@ -192,13 +196,13 @@ Execute on TVM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 124-127
+.. GENERATED FROM PYTHON SOURCE LINES 125-128
 
 Look up synset name
 -------------------
 Look up prediction top 1 index in 1000 class synset.
 
-.. GENERATED FROM PYTHON SOURCE LINES 127-144
+.. GENERATED FROM PYTHON SOURCE LINES 128-145
 
 .. code-block:: default
 
@@ -228,7 +232,7 @@ Look up prediction top 1 index in 1000 class synset.
  .. code-block:: none
 
     Relay top-1 id: 285, class name: Egyptian cat
-
    1/1 [==============================] - ETA: 0s
    1/1 [==============================] - 1s 945ms/step
+
    1/1 [==============================] - ETA: 0s
    1/1 [==============================] - 1s 933ms/step
     Keras top-1 id: 285, class name: Egyptian cat
 
 
diff --git a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
index ba6b325f7f..267a11b602 100644
--- a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/compile_models/from_mxnet.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_compile_models_from_mxnet.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_compile_models_from_mxnet.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/4bbcfcce3c35b0b795a42c998ceb3770/from_mxnet.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -24,20 +28,17 @@ Compile MXNet Models
 ====================
 **Author**: `Joshua Z. Zhang <https://zhreshold.github.io/>`_,             `Kazutaka Morita <https://github.com/kazum>`_
 
-This article is an introductory tutorial to deploy mxnet models with Relay.
-
-For us to begin with, mxnet module is required to be installed.
-
-A quick solution is
+This article is an introductory tutorial to deploy mxnet models with Relay. To begin, we must install `mxnet`:
 
 .. code-block:: bash
 
-    pip install mxnet --user
+    %%shell
+    pip install mxnet
 
 or please refer to official installation guide.
 https://mxnet.apache.org/versions/master/install/index.html
 
-.. GENERATED FROM PYTHON SOURCE LINES 38-45
+.. GENERATED FROM PYTHON SOURCE LINES 35-42
 
 .. code-block:: default
 
@@ -55,13 +56,13 @@ https://mxnet.apache.org/versions/master/install/index.html
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 51-54
+.. GENERATED FROM PYTHON SOURCE LINES 49-52
 
 Download Resnet18 model from Gluon Model Zoo
 ---------------------------------------------
 In this section, we download a pretrained imagenet model and classify an image.
 
-.. GENERATED FROM PYTHON SOURCE LINES 54-91
+.. GENERATED FROM PYTHON SOURCE LINES 52-89
 
 .. code-block:: default
 
@@ -115,13 +116,13 @@ In this section, we download a pretrained imagenet model and classify an image.
 
  .. code-block:: none
 
-    Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zipe877dd51-e0f1-4796-a70d-2b2516e1b6f4 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
+    Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip5b2e833d-6156-4871-b906-654707010ba8 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
     x (1, 3, 224, 224)
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 92-97
+.. GENERATED FROM PYTHON SOURCE LINES 90-95
 
 Compile the Graph
 -----------------
@@ -129,7 +130,7 @@ Now we would like to port the Gluon model to a portable computational graph.
 It's as easy as several lines.
 We support MXNet static graph(symbol) and HybridBlock in mxnet.gluon
 
-.. GENERATED FROM PYTHON SOURCE LINES 97-103
+.. GENERATED FROM PYTHON SOURCE LINES 95-101
 
 .. code-block:: default
 
@@ -146,11 +147,11 @@ We support MXNet static graph(symbol) and HybridBlock in mxnet.gluon
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 104-105
+.. GENERATED FROM PYTHON SOURCE LINES 102-103
 
 now compile the graph
 
-.. GENERATED FROM PYTHON SOURCE LINES 105-109
+.. GENERATED FROM PYTHON SOURCE LINES 103-107
 
 .. code-block:: default
 
@@ -165,13 +166,13 @@ now compile the graph
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 110-113
+.. GENERATED FROM PYTHON SOURCE LINES 108-111
 
 Execute the portable graph on TVM
 ---------------------------------
 Now, we would like to reproduce the same forward computation using TVM.
 
-.. GENERATED FROM PYTHON SOURCE LINES 113-127
+.. GENERATED FROM PYTHON SOURCE LINES 111-125
 
 .. code-block:: default
 
@@ -202,14 +203,14 @@ Now, we would like to reproduce the same forward computation using TVM.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 128-132
+.. GENERATED FROM PYTHON SOURCE LINES 126-130
 
 Use MXNet symbol with pretrained weights
 ----------------------------------------
 MXNet often use `arg_params` and `aux_params` to store network parameters
 separately, here we show how to use these weights with existing API
 
-.. GENERATED FROM PYTHON SOURCE LINES 132-147
+.. GENERATED FROM PYTHON SOURCE LINES 130-145
 
 .. code-block:: default
 
@@ -235,11 +236,11 @@ separately, here we show how to use these weights with existing API
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 148-149
+.. GENERATED FROM PYTHON SOURCE LINES 146-147
 
 for a normal mxnet model, we start from here
 
-.. GENERATED FROM PYTHON SOURCE LINES 149-153
+.. GENERATED FROM PYTHON SOURCE LINES 147-151
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/compile_models/from_oneflow.rst.txt b/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
index ae187e476c..df8666ff17 100644
--- a/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/compile_models/from_oneflow.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_compile_models_from_oneflow.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_compile_models_from_oneflow.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/2e7b51cb39c472626dd3f046d9b89966/from_oneflow.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -30,15 +34,16 @@ A quick solution is to install via pip
 
 .. code-block:: bash
 
+    %%shell
     pip install flowvision==0.1.0
-    python3 -m pip install -f https://release.oneflow.info oneflow==0.7.0+cpu
+    pip install -f https://release.oneflow.info oneflow==0.7.0+cpu
 
 or please refer to official site:
 https://github.com/Oneflow-Inc/oneflow
 
 Currently, TVM supports OneFlow 0.7.0. Other versions may be unstable.
 
-.. GENERATED FROM PYTHON SOURCE LINES 38-53
+.. GENERATED FROM PYTHON SOURCE LINES 39-54
 
 .. code-block:: default
 
@@ -90,12 +95,12 @@ Currently, TVM supports OneFlow 0.7.0. Other versions may be unstable.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 59-61
+.. GENERATED FROM PYTHON SOURCE LINES 61-63
 
 Load a pretrained OneFlow model and save model
 ----------------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 61-69
+.. GENERATED FROM PYTHON SOURCE LINES 63-71
 
 .. code-block:: default
 
@@ -116,18 +121,18 @@ Load a pretrained OneFlow model and save model
  .. code-block:: none
 
     Downloading: "https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip" to /workspace/.oneflow/flowvision_cache/resnet18.zip
-
      0%|          | 0.00/41.5M [00:00<?, ?B/s]
     15%|#5        | 6.33M/41.5M [00:00<00:00, 49.3MB/s]
     32%|###2      | 13.3M/41.5M [00:00<00:00, 61.8MB/s]
     47%|####6     | 19.4M/41.5M [00:00<00:00, 54.1MB/s]
     60%|#####9    | 24.7M/41.5M [00:00<00:00, 36.5MB/s]
     82%|########2 | 34.1M/41.5M [00:00<00:00, 50.1MB/s]
     96%|#########6| 40.0M/41.5M [00:00<00:00, 51.9MB/s]
    100%|##########| 41.5M/41.5M [00:00<00:00, 51.3MB/s]
+
      0%|          | 0.00/41.5M [00:00<?, ?B/s]
     23%|##2       | 9.38M/41.5M [00:00<00:00, 98.4MB/s]
     45%|####5     | 18.8M/41.5M [00:00<00:00, 87.4MB/s]
     66%|######5   | 27.2M/41.5M [00:00<00:00, 75.4MB/s]
     83%|########3 | 34.5M/41.5M [00:00<00:00, 75.5MB/s]
    100%|##########| 41.5M/41.5M [00:00<00:00, 68.7MB/s]
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 70-73
+.. GENERATED FROM PYTHON SOURCE LINES 72-75
 
 Load a test image
 -----------------
 Classic cat example!
 
-.. GENERATED FROM PYTHON SOURCE LINES 73-93
+.. GENERATED FROM PYTHON SOURCE LINES 75-95
 
 .. code-block:: default
 
@@ -158,13 +163,13 @@ Classic cat example!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 94-97
+.. GENERATED FROM PYTHON SOURCE LINES 96-99
 
 Import the graph to Relay
 -------------------------
 Convert OneFlow graph to Relay graph. The input name can be arbitrary.
 
-.. GENERATED FROM PYTHON SOURCE LINES 97-112
+.. GENERATED FROM PYTHON SOURCE LINES 99-114
 
 .. code-block:: default
 
@@ -190,13 +195,13 @@ Convert OneFlow graph to Relay graph. The input name can be arbitrary.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 113-116
+.. GENERATED FROM PYTHON SOURCE LINES 115-118
 
 Relay Build
 -----------
 Compile the graph to llvm target with given input specification.
 
-.. GENERATED FROM PYTHON SOURCE LINES 116-121
+.. GENERATED FROM PYTHON SOURCE LINES 118-123
 
 .. code-block:: default
 
@@ -212,13 +217,13 @@ Compile the graph to llvm target with given input specification.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 122-125
+.. GENERATED FROM PYTHON SOURCE LINES 124-127
 
 Execute the portable graph on TVM
 ---------------------------------
 Now we can try deploying the compiled model on target.
 
-.. GENERATED FROM PYTHON SOURCE LINES 125-133
+.. GENERATED FROM PYTHON SOURCE LINES 127-135
 
 .. code-block:: default
 
@@ -244,13 +249,13 @@ Now we can try deploying the compiled model on target.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 134-137
+.. GENERATED FROM PYTHON SOURCE LINES 136-139
 
 Look up synset name
 -------------------
 Look up prediction top 1 index in 1000 class synset.
 
-.. GENERATED FROM PYTHON SOURCE LINES 137-184
+.. GENERATED FROM PYTHON SOURCE LINES 139-186
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/compile_models/from_onnx.rst.txt b/docs/_sources/how_to/compile_models/from_onnx.rst.txt
index efd22a828e..66458854a5 100644
--- a/docs/_sources/how_to/compile_models/from_onnx.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_onnx.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/compile_models/from_onnx.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_compile_models_from_onnx.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_compile_models_from_onnx.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/779f52a44f2b8ab22dc21eee0c27fd4d/from_onnx.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -24,18 +28,17 @@ Compile ONNX Models
 
 This article is an introductory tutorial to deploy ONNX models with Relay.
 
-For us to begin with, ONNX package must be installed.
-
-A quick solution is to install protobuf compiler, and
+To begin, install the ONNX package:
 
 .. code-block:: bash
 
-    pip install --user onnx onnxoptimizer
+    %%shell
+    pip install onnx onnxoptimizer
 
-or please refer to official site.
+Alternatively, you can refer to official site:
 https://github.com/onnx/onnx
 
-.. GENERATED FROM PYTHON SOURCE LINES 35-43
+.. GENERATED FROM PYTHON SOURCE LINES 34-42
 
 .. code-block:: default
 
@@ -54,7 +57,7 @@ https://github.com/onnx/onnx
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 49-54
+.. GENERATED FROM PYTHON SOURCE LINES 48-53
 
 Load pretrained ONNX model
 ---------------------------------------------
@@ -62,7 +65,7 @@ The example super resolution model used here is exactly the same model in onnx t
 http://pytorch.org/tutorials/advanced/super_resolution_with_caffe2.html
 we skip the pytorch model construction part, and download the saved onnx model
 
-.. GENERATED FROM PYTHON SOURCE LINES 54-66
+.. GENERATED FROM PYTHON SOURCE LINES 53-65
 
 .. code-block:: default
 
@@ -85,7 +88,7 @@ we skip the pytorch model construction part, and download the saved onnx model
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 67-74
+.. GENERATED FROM PYTHON SOURCE LINES 66-73
 
 Load a test image
 ---------------------------------------------
@@ -95,7 +98,7 @@ axis, a 672x672 image. Re-scale the cat image to fit this input shape then
 convert to `YCbCr`. The super resolution model will then be applied to the
 luminance (`Y`) channel.
 
-.. GENERATED FROM PYTHON SOURCE LINES 74-83
+.. GENERATED FROM PYTHON SOURCE LINES 73-82
 
 .. code-block:: default
 
@@ -115,7 +118,7 @@ luminance (`Y`) channel.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 84-94
+.. GENERATED FROM PYTHON SOURCE LINES 83-93
 
 Compile the model with relay
 ---------------------------------------------
@@ -128,7 +131,7 @@ Passing in the shape dictionary to the `relay.frontend.from_onnx` method
 tells relay which ONNX parameters are inputs, and which are parameters, and
 provides a static definition of the input size.
 
-.. GENERATED FROM PYTHON SOURCE LINES 94-105
+.. GENERATED FROM PYTHON SOURCE LINES 93-104
 
 .. code-block:: default
 
@@ -159,12 +162,12 @@ provides a static definition of the input size.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 106-108
+.. GENERATED FROM PYTHON SOURCE LINES 105-107
 
 Execute on TVM
 ---------------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 108-111
+.. GENERATED FROM PYTHON SOURCE LINES 107-110
 
 .. code-block:: default
 
@@ -178,7 +181,7 @@ Execute on TVM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 112-117
+.. GENERATED FROM PYTHON SOURCE LINES 111-116
 
 Display results
 ---------------------------------------------
@@ -186,7 +189,7 @@ We put input and output image neck to neck. The luminance channel, `Y` is the ou
 from the model. The chroma channels `Cb` and `Cr` are resized to match with a simple
 bicubic algorithm. The image is then recombined and converted back to `RGB`.
 
-.. GENERATED FROM PYTHON SOURCE LINES 117-129
+.. GENERATED FROM PYTHON SOURCE LINES 116-128
 
 .. code-block:: default
 
@@ -215,15 +218,15 @@ bicubic algorithm. The image is then recombined and converted back to `RGB`.
 
  .. code-block:: none
 
-    /workspace/gallery/how_to/compile_models/from_onnx.py:120: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
+    /workspace/gallery/how_to/compile_models/from_onnx.py:119: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
       out_cb = img_cb.resize(out_y.size, Image.BICUBIC)
-    /workspace/gallery/how_to/compile_models/from_onnx.py:121: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
+    /workspace/gallery/how_to/compile_models/from_onnx.py:120: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
       out_cr = img_cr.resize(out_y.size, Image.BICUBIC)
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 130-142
+.. GENERATED FROM PYTHON SOURCE LINES 129-141
 
 Notes
 ---------------------------------------------
diff --git a/docs/_sources/how_to/compile_models/from_paddle.rst.txt b/docs/_sources/how_to/compile_models/from_paddle.rst.txt
index 859616ee87..283c1b38b9 100644
--- a/docs/_sources/how_to/compile_models/from_paddle.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_paddle.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/compile_models/from_paddle.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_compile_models_from_paddle.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_compile_models_from_paddle.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/a608d8b69371e9bc149dd89f6db2c38e/from_paddle.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -23,14 +27,14 @@ Compile PaddlePaddle Models
 **Author**: `Ziyuan Ma <https://github.com/ZiyuanMa/>`_
 
 This article is an introductory tutorial to deploy PaddlePaddle models with Relay.
-For us to begin with, PaddlePaddle>=2.1.3 is required to be installed.
-A quick solution is
+To begin, we'll install PaddlePaddle>=2.1.3:
 
 .. code-block:: bash
 
+    %%shell
     pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
 
-or please refer to official site.
+For more details, refer to the official install instructions at:
 https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html
 
 .. GENERATED FROM PYTHON SOURCE LINES 33-41
diff --git a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
index 8f29de3bde..eee4883552 100644
--- a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/compile_models/from_pytorch.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_compile_models_from_pytorch.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_compile_models_from_pytorch.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/1f4943aed1aa607b2775c18b1d71db10/from_pytorch.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -24,15 +28,15 @@ Compile PyTorch Models
 
 This article is an introductory tutorial to deploy PyTorch models with Relay.
 
-For us to begin with, PyTorch should be installed.
-TorchVision is also required since we will be using it as our model zoo.
-
-A quick solution is to install via pip
+For us to begin, PyTorch should be installed.
+TorchVision is also required so we can use the model zoo.
+A quick solution is to install via pip:
 
 .. code-block:: bash
 
-    pip install torch==1.7.0
-    pip install torchvision==0.8.1
+    %%shell
+    pip install torch
+    pip install torchvision
 
 or please refer to official site
 https://pytorch.org/get-started/locally/
@@ -98,7 +102,7 @@ Load a pretrained PyTorch model
     /venv/apache-tvm-py3.7/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
       warnings.warn(msg)
     Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /workspace/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
-
      0%|          | 0.00/44.7M [00:00<?, ?B/s]
     32%|###1      | 14.1M/44.7M [00:00<00:00, 148MB/s]
     63%|######3   | 28.2M/44.7M [00:00<00:00, 111MB/s]
     88%|########8 | 39.4M/44.7M [00:00<00:00, 110MB/s]
    100%|##########| 44.7M/44.7M [00:00<00:00, 111MB/s]
+
      0%|          | 0.00/44.7M [00:00<?, ?B/s]
     28%|##7       | 12.3M/44.7M [00:00<00:00, 129MB/s]
     55%|#####5    | 24.6M/44.7M [00:00<00:00, 109MB/s]
     79%|#######8  | 35.2M/44.7M [00:00<00:00, 106MB/s]
    100%|##########| 44.7M/44.7M [00:00<00:00, 106MB/s]
 
 
 
diff --git a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
index 1f414cbd55..55c9576734 100644
--- a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/compile_models/from_tensorflow.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_compile_models_from_tensorflow.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_compile_models_from_tensorflow.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/83e3b018e8bac8d31bb331d200a33a04/from_tensorflow.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -24,9 +28,14 @@ This article is an introductory tutorial to deploy tensorflow models with TVM.
 
 For us to begin with, tensorflow python module is required to be installed.
 
+.. code-block:: bash
+
+    %%shell
+    pip install tensorflow
+
 Please refer to https://www.tensorflow.org/install
 
-.. GENERATED FROM PYTHON SOURCE LINES 26-70
+.. GENERATED FROM PYTHON SOURCE LINES 31-75
 
 .. code-block:: default
 
@@ -81,14 +90,14 @@ Please refer to https://www.tensorflow.org/install
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 76-80
+.. GENERATED FROM PYTHON SOURCE LINES 81-85
 
 Tutorials
 ---------
 Please refer docs/frontend/tensorflow.md for more details for various models
 from tensorflow.
 
-.. GENERATED FROM PYTHON SOURCE LINES 80-101
+.. GENERATED FROM PYTHON SOURCE LINES 85-106
 
 .. code-block:: default
 
@@ -120,13 +129,13 @@ from tensorflow.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 102-105
+.. GENERATED FROM PYTHON SOURCE LINES 107-110
 
 Download required files
 -----------------------
 Download files listed above.
 
-.. GENERATED FROM PYTHON SOURCE LINES 105-112
+.. GENERATED FROM PYTHON SOURCE LINES 110-117
 
 .. code-block:: default
 
@@ -144,13 +153,13 @@ Download files listed above.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 113-116
+.. GENERATED FROM PYTHON SOURCE LINES 118-121
 
 Import model
 ------------
 Creates tensorflow graph definition from protobuf file.
 
-.. GENERATED FROM PYTHON SOURCE LINES 116-127
+.. GENERATED FROM PYTHON SOURCE LINES 121-132
 
 .. code-block:: default
 
@@ -172,7 +181,7 @@ Creates tensorflow graph definition from protobuf file.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 128-136
+.. GENERATED FROM PYTHON SOURCE LINES 133-141
 
 Decode image
 ------------
@@ -183,7 +192,7 @@ Decode image
   Hence we supply decoded frame to TVM instead.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 136-143
+.. GENERATED FROM PYTHON SOURCE LINES 141-148
 
 .. code-block:: default
 
@@ -201,7 +210,7 @@ Decode image
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 144-151
+.. GENERATED FROM PYTHON SOURCE LINES 149-156
 
 Import the graph to Relay
 -------------------------
@@ -211,7 +220,7 @@ Results:
   sym: relay expr for given tensorflow protobuf.
   params: params converted from tensorflow params (tensor protobuf).
 
-.. GENERATED FROM PYTHON SOURCE LINES 151-156
+.. GENERATED FROM PYTHON SOURCE LINES 156-161
 
 .. code-block:: default
 
@@ -237,7 +246,7 @@ Results:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 157-165
+.. GENERATED FROM PYTHON SOURCE LINES 162-170
 
 Relay Build
 -----------
@@ -248,7 +257,7 @@ Results:
   params: final params after compilation.
   lib: target library which can be deployed on target with TVM runtime.
 
-.. GENERATED FROM PYTHON SOURCE LINES 165-169
+.. GENERATED FROM PYTHON SOURCE LINES 170-174
 
 .. code-block:: default
 
@@ -263,13 +272,13 @@ Results:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 170-173
+.. GENERATED FROM PYTHON SOURCE LINES 175-178
 
 Execute the portable graph on TVM
 ---------------------------------
 Now we can try deploying the compiled model on target.
 
-.. GENERATED FROM PYTHON SOURCE LINES 173-185
+.. GENERATED FROM PYTHON SOURCE LINES 178-190
 
 .. code-block:: default
 
@@ -292,13 +301,13 @@ Now we can try deploying the compiled model on target.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 186-189
+.. GENERATED FROM PYTHON SOURCE LINES 191-194
 
 Process the output
 ------------------
 Process the model output to human readable text for InceptionV1.
 
-.. GENERATED FROM PYTHON SOURCE LINES 189-202
+.. GENERATED FROM PYTHON SOURCE LINES 194-207
 
 .. code-block:: default
 
@@ -332,13 +341,13 @@ Process the model output to human readable text for InceptionV1.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 203-206
+.. GENERATED FROM PYTHON SOURCE LINES 208-211
 
 Inference on tensorflow
 -----------------------
 Run the corresponding model on tensorflow
 
-.. GENERATED FROM PYTHON SOURCE LINES 206-259
+.. GENERATED FROM PYTHON SOURCE LINES 211-264
 
 .. code-block:: default
 
@@ -416,7 +425,7 @@ Run the corresponding model on tensorflow
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  12.404 seconds)
+   **Total running time of the script:** ( 1 minutes  10.111 seconds)
 
 
 .. _sphx_glr_download_how_to_compile_models_from_tensorflow.py:
diff --git a/docs/_sources/how_to/compile_models/from_tflite.rst.txt b/docs/_sources/how_to/compile_models/from_tflite.rst.txt
index e5fe00eb5f..a2706150cf 100644
--- a/docs/_sources/how_to/compile_models/from_tflite.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_tflite.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/compile_models/from_tflite.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_compile_models_from_tflite.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_compile_models_from_tflite.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/23968bb778cd9591b7ad858bf17dcc3e/from_tflite.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -28,9 +32,8 @@ To get started, TFLite package needs to be installed as prerequisite.
 
 .. code-block:: bash
 
-    # install tflite
-    pip install tflite==2.1.0 --user
-
+    %%shell
+    pip install tflite==2.1.0
 
 or you could generate TFLite package yourself. The steps are the following:
 
@@ -55,12 +58,12 @@ Now please check if TFLite package is installed successfully, ``python -c "impor
 
 Below you can find an example on how to compile TFLite model using TVM.
 
-.. GENERATED FROM PYTHON SOURCE LINES 57-59
+.. GENERATED FROM PYTHON SOURCE LINES 56-58
 
 Utils for downloading and extracting zip files
 ----------------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 59-76
+.. GENERATED FROM PYTHON SOURCE LINES 58-75
 
 .. code-block:: default
 
@@ -88,13 +91,13 @@ Utils for downloading and extracting zip files
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 82-85
+.. GENERATED FROM PYTHON SOURCE LINES 81-84
 
 Load pretrained TFLite model
 ----------------------------
 Load mobilenet V1 TFLite model provided by Google
 
-.. GENERATED FROM PYTHON SOURCE LINES 85-108
+.. GENERATED FROM PYTHON SOURCE LINES 84-107
 
 .. code-block:: default
 
@@ -128,13 +131,13 @@ Load mobilenet V1 TFLite model provided by Google
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 109-112
+.. GENERATED FROM PYTHON SOURCE LINES 108-111
 
 Load a test image
 -----------------
 A single cat dominates the examples!
 
-.. GENERATED FROM PYTHON SOURCE LINES 112-133
+.. GENERATED FROM PYTHON SOURCE LINES 111-132
 
 .. code-block:: default
 
@@ -177,12 +180,12 @@ A single cat dominates the examples!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 134-136
+.. GENERATED FROM PYTHON SOURCE LINES 133-135
 
 Compile the model with relay
 ----------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 136-154
+.. GENERATED FROM PYTHON SOURCE LINES 135-153
 
 .. code-block:: default
 
@@ -211,12 +214,12 @@ Compile the model with relay
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 155-157
+.. GENERATED FROM PYTHON SOURCE LINES 154-156
 
 Execute on TVM
 --------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 157-173
+.. GENERATED FROM PYTHON SOURCE LINES 156-172
 
 .. code-block:: default
 
@@ -243,12 +246,12 @@ Execute on TVM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 174-176
+.. GENERATED FROM PYTHON SOURCE LINES 173-175
 
 Display results
 ---------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 176-201
+.. GENERATED FROM PYTHON SOURCE LINES 175-200
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/compile_models/index.rst.txt b/docs/_sources/how_to/compile_models/index.rst.txt
index c6ddcffcf4..a6a74f929a 100644
--- a/docs/_sources/how_to/compile_models/index.rst.txt
+++ b/docs/_sources/how_to/compile_models/index.rst.txt
@@ -51,7 +51,7 @@ formats. These how-tos demostrate how to import models using the Python API.
 
 .. raw:: html
 
-    <div class="sphx-glr-thumbcontainer" tooltip="This article is an introductory tutorial to deploy mxnet models with Relay.">
+    <div class="sphx-glr-thumbcontainer" tooltip="This article is an introductory tutorial to deploy mxnet models with Relay. To begin, we must i...">
 
 .. only:: html
 
@@ -85,7 +85,7 @@ formats. These how-tos demostrate how to import models using the Python API.
 
 .. raw:: html
 
-    <div class="sphx-glr-thumbcontainer" tooltip="This article is an introductory tutorial to deploy keras models with Relay.">
+    <div class="sphx-glr-thumbcontainer" tooltip="This article is an introductory tutorial to deploy Keras models with Relay.">
 
 .. only:: html
 
@@ -153,7 +153,7 @@ formats. These how-tos demostrate how to import models using the Python API.
 
 .. raw:: html
 
-    <div class="sphx-glr-thumbcontainer" tooltip="This article is an introductory tutorial to deploy PaddlePaddle models with Relay. For us to be...">
+    <div class="sphx-glr-thumbcontainer" tooltip="This article is an introductory tutorial to deploy PaddlePaddle models with Relay. To begin, we...">
 
 .. only:: html
 
diff --git a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
index 9a6f263be2..3cd4e3c06b 100644
--- a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
@@ -5,26 +5,26 @@
 
 Computation times
 =================
-**05:41.684** total execution time for **how_to_compile_models** files:
+**05:33.870** total execution time for **how_to_compile_models** files:
 
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``) | 01:12.404 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``) | 01:10.111 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)       | 01:08.808 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)       | 01:06.880 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)         | 00:46.594 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)         | 00:45.810 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_oneflow.py` (``from_oneflow.py``)       | 00:32.224 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_oneflow.py` (``from_oneflow.py``)       | 00:31.605 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)           | 00:28.855 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)           | 00:28.422 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)         | 00:26.197 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)         | 00:25.666 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)         | 00:25.292 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)         | 00:24.498 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)       | 00:22.563 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)       | 00:22.170 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)           | 00:16.353 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)           | 00:16.280 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)             | 00:02.395 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)             | 00:02.426 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/deploy_models/deploy_model_on_adreno.rst.txt b/docs/_sources/how_to/deploy_models/deploy_model_on_adreno.rst.txt
index 286b9558fc..5f020b1cea 100644
--- a/docs/_sources/how_to/deploy_models/deploy_model_on_adreno.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_model_on_adreno.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/deploy_models/deploy_model_on_adreno.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_model_on_adreno.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_model_on_adreno.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/b9e7311d8c56eb6e6aca08f0be35ff03/deploy_model_on_adreno.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -33,6 +37,7 @@ A quick solution is to install it via pip:
 
 .. code-block:: bash
 
+  %%shell
   pip install torch
   pip install torchvision
 
@@ -44,7 +49,7 @@ See the following instructions on how to build it.
 After the build section there should be two files in *build* directory «libtvm_runtime.so» and «tvm_rpc».
 Let's push them to the device and run TVM RPC Server.
 
-.. GENERATED FROM PYTHON SOURCE LINES 47-116
+.. GENERATED FROM PYTHON SOURCE LINES 48-117
 
 TVM RPC Server
 --------------
@@ -116,13 +121,13 @@ the output can be:
    android      1      1     0
    ----------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 118-121
+.. GENERATED FROM PYTHON SOURCE LINES 119-122
 
 Load a test image
 -----------------
 As an example we would use classical cat image from ImageNet
 
-.. GENERATED FROM PYTHON SOURCE LINES 121-148
+.. GENERATED FROM PYTHON SOURCE LINES 122-149
 
 .. code-block:: default
 
@@ -165,13 +170,13 @@ As an example we would use classical cat image from ImageNet
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 154-157
+.. GENERATED FROM PYTHON SOURCE LINES 155-158
 
 Load pretrained Pytorch model
 -----------------------------
 Create a Relay graph from a Pytorch ResNet-18 model
 
-.. GENERATED FROM PYTHON SOURCE LINES 157-180
+.. GENERATED FROM PYTHON SOURCE LINES 158-181
 
 .. code-block:: default
 
@@ -218,13 +223,13 @@ Create a Relay graph from a Pytorch ResNet-18 model
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 181-184
+.. GENERATED FROM PYTHON SOURCE LINES 182-185
 
 Precisions
 ----------
 Since TVM support Mixed Precision, we need to register mixed_precision_conversion:
 
-.. GENERATED FROM PYTHON SOURCE LINES 184-209
+.. GENERATED FROM PYTHON SOURCE LINES 185-210
 
 .. code-block:: default
 
@@ -260,11 +265,11 @@ Since TVM support Mixed Precision, we need to register mixed_precision_conversio
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 210-211
+.. GENERATED FROM PYTHON SOURCE LINES 211-212
 
 and also define the conversion function itself
 
-.. GENERATED FROM PYTHON SOURCE LINES 211-227
+.. GENERATED FROM PYTHON SOURCE LINES 212-228
 
 .. code-block:: default
 
@@ -291,11 +296,11 @@ and also define the conversion function itself
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 228-229
+.. GENERATED FROM PYTHON SOURCE LINES 229-230
 
 Let's choose "float16_acc32" for example.
 
-.. GENERATED FROM PYTHON SOURCE LINES 229-235
+.. GENERATED FROM PYTHON SOURCE LINES 230-236
 
 .. code-block:: default
 
@@ -537,13 +542,13 @@ Let's choose "float16_acc32" for example.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 236-239
+.. GENERATED FROM PYTHON SOURCE LINES 237-240
 
 As you can see in the IR, the architecture now contains cast operations, which are
 needed to convert to FP16 precision.
 You can also use "float16" or "float32" precisions as other dtype options.
 
-.. GENERATED FROM PYTHON SOURCE LINES 241-249
+.. GENERATED FROM PYTHON SOURCE LINES 242-250
 
 Compile the model with relay
 ----------------------------
@@ -554,7 +559,7 @@ If running it on the Android device, we need to
 specify its instruction set. Set :code:`local_demo` to False if you want
 to run this tutorial with a real device.
 
-.. GENERATED FROM PYTHON SOURCE LINES 249-271
+.. GENERATED FROM PYTHON SOURCE LINES 250-272
 
 .. code-block:: default
 
@@ -587,14 +592,14 @@ to run this tutorial with a real device.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 272-276
+.. GENERATED FROM PYTHON SOURCE LINES 273-277
 
 Deploy the Model Remotely by RPC
 --------------------------------
 Using RPC you can deploy the model from host
 machine to the remote Adreno device
 
-.. GENERATED FROM PYTHON SOURCE LINES 276-307
+.. GENERATED FROM PYTHON SOURCE LINES 277-308
 
 .. code-block:: default
 
@@ -636,13 +641,13 @@ machine to the remote Adreno device
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 308-311
+.. GENERATED FROM PYTHON SOURCE LINES 309-312
 
 Run inference
 -------------
 We now can set inputs, infer our model and get predictions as output
 
-.. GENERATED FROM PYTHON SOURCE LINES 311-315
+.. GENERATED FROM PYTHON SOURCE LINES 312-316
 
 .. code-block:: default
 
@@ -657,14 +662,14 @@ We now can set inputs, infer our model and get predictions as output
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 316-320
+.. GENERATED FROM PYTHON SOURCE LINES 317-321
 
 Get predictions and performance statistic
 -----------------------------------------
 This piece of code displays the top-1 and top-5 predictions, as
 well as provides information about the model's performance
 
-.. GENERATED FROM PYTHON SOURCE LINES 320-352
+.. GENERATED FROM PYTHON SOURCE LINES 321-353
 
 .. code-block:: default
 
@@ -723,7 +728,7 @@ well as provides information about the model's performance
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-     3340.1604    3340.2200    3343.9844    3336.3355      2.2440   
+     3340.8585    3339.3044    3354.8016    3335.9312      5.3039   
                
 
 
@@ -732,7 +737,7 @@ well as provides information about the model's performance
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  1.004 seconds)
+   **Total running time of the script:** ( 1 minutes  0.638 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_model_on_adreno.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
index f8635e13aa..e525bbf6c7 100644
--- a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/deploy_models/deploy_model_on_android.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_model_on_android.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_model_on_android.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/eed2658f15243bab719b2de7769fa45a/deploy_model_on_android.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -433,7 +437,7 @@ Execute on TVM
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      15.7544      15.7561      15.8753      15.6008       0.0897   
+      15.8475      15.6642      16.5656      15.4942       0.3540   
                
 
 
diff --git a/docs/_sources/how_to/deploy_models/deploy_model_on_nano.rst.txt b/docs/_sources/how_to/deploy_models/deploy_model_on_nano.rst.txt
index 501f4b3aa4..be70c7322c 100644
--- a/docs/_sources/how_to/deploy_models/deploy_model_on_nano.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_model_on_nano.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/deploy_models/deploy_model_on_nano.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_model_on_nano.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_model_on_nano.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/cafefaac0e14b00fd7644da616cab35a/deploy_model_on_nano.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -47,7 +51,7 @@ it on Jetson Nano.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 42-84
+.. GENERATED FROM PYTHON SOURCE LINES 43-85
 
 .. _build-tvm-runtime-on-jetson-nano:
 
@@ -92,7 +96,7 @@ directory is in :code:`~/tvm`):
 
 To update the environment variables, execute :code:`source ~/.bashrc`.
 
-.. GENERATED FROM PYTHON SOURCE LINES 86-102
+.. GENERATED FROM PYTHON SOURCE LINES 87-103
 
 Set Up RPC Server on Device
 ---------------------------
@@ -111,7 +115,7 @@ successfully on your device.
      INFO:RPCServer:bind to 0.0.0.0:9091
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 104-111
+.. GENERATED FROM PYTHON SOURCE LINES 105-112
 
 Prepare the Pre-trained Model
 -----------------------------
@@ -121,7 +125,7 @@ We will use pre-trained model from
 `MXNet Gluon model zoo <https://mxnet.apache.org/api/python/gluon/model_zoo.html>`_.
 You can found more details about this part at tutorial :ref:`tutorial-from-mxnet`.
 
-.. GENERATED FROM PYTHON SOURCE LINES 111-119
+.. GENERATED FROM PYTHON SOURCE LINES 112-120
 
 .. code-block:: default
 
@@ -140,12 +144,12 @@ You can found more details about this part at tutorial :ref:`tutorial-from-mxnet
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 120-122
+.. GENERATED FROM PYTHON SOURCE LINES 121-123
 
 In order to test our model, here we download an image of cat and
 transform its format.
 
-.. GENERATED FROM PYTHON SOURCE LINES 122-138
+.. GENERATED FROM PYTHON SOURCE LINES 123-139
 
 .. code-block:: default
 
@@ -172,12 +176,12 @@ transform its format.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 139-141
+.. GENERATED FROM PYTHON SOURCE LINES 140-142
 
 synset is used to transform the label from number of ImageNet class to
 the word human can understand.
 
-.. GENERATED FROM PYTHON SOURCE LINES 141-154
+.. GENERATED FROM PYTHON SOURCE LINES 142-155
 
 .. code-block:: default
 
@@ -201,12 +205,12 @@ the word human can understand.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 155-157
+.. GENERATED FROM PYTHON SOURCE LINES 156-158
 
 Now we would like to port the Gluon model to a portable computational graph.
 It's as easy as several lines.
 
-.. GENERATED FROM PYTHON SOURCE LINES 157-165
+.. GENERATED FROM PYTHON SOURCE LINES 158-166
 
 .. code-block:: default
 
@@ -225,11 +229,11 @@ It's as easy as several lines.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 166-167
+.. GENERATED FROM PYTHON SOURCE LINES 167-168
 
 Here are some basic data workload configurations.
 
-.. GENERATED FROM PYTHON SOURCE LINES 167-172
+.. GENERATED FROM PYTHON SOURCE LINES 168-173
 
 .. code-block:: default
 
@@ -245,7 +249,7 @@ Here are some basic data workload configurations.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 173-182
+.. GENERATED FROM PYTHON SOURCE LINES 174-183
 
 Compile The Graph
 -----------------
@@ -257,14 +261,14 @@ apart from arguments :code:`net` and :code:`params` to specify the
 deep learning workload. Actually, the option matters, different option
 will lead to very different performance.
 
-.. GENERATED FROM PYTHON SOURCE LINES 184-188
+.. GENERATED FROM PYTHON SOURCE LINES 185-189
 
 If we run the example on our x86 server for demonstration, we can simply
 set it as :code:`llvm`. If running it on the Jetson Nano, we need to
 set it as :code:`nvidia/jetson-nano`. Set :code:`local_demo` to False
 if you want to run this tutorial with a real device.
 
-.. GENERATED FROM PYTHON SOURCE LINES 188-214
+.. GENERATED FROM PYTHON SOURCE LINES 189-215
 
 .. code-block:: default
 
@@ -308,14 +312,14 @@ if you want to run this tutorial with a real device.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 215-219
+.. GENERATED FROM PYTHON SOURCE LINES 216-220
 
 Deploy the Model Remotely by RPC
 --------------------------------
 With RPC, you can deploy the model remotely from your host machine
 to the remote device.
 
-.. GENERATED FROM PYTHON SOURCE LINES 219-249
+.. GENERATED FROM PYTHON SOURCE LINES 220-250
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/deploy_models/deploy_model_on_rasp.rst.txt b/docs/_sources/how_to/deploy_models/deploy_model_on_rasp.rst.txt
index 60ba16cd99..a07d6570a6 100644
--- a/docs/_sources/how_to/deploy_models/deploy_model_on_rasp.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_model_on_rasp.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/deploy_models/deploy_model_on_rasp.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_model_on_rasp.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_model_on_rasp.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/7c392f39b90d93406ef30c6185c5686c/deploy_model_on_rasp.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
index cad844148e..bdc72328f3 100644
--- a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/deploy_models/deploy_object_detection_pytorch.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_object_detection_pytorch.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_object_detection_pytorch.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/399e1d7889ca66b69d51655784827503/deploy_object_detection_pytorch.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -30,8 +34,8 @@ A quick solution is to install via pip
 
 .. code-block:: bash
 
-    pip install torch==1.7.0
-    pip install torchvision==0.8.1
+    pip install torch
+    pip install torchvision
 
 or please refer to official site
 https://pytorch.org/get-started/locally/
@@ -127,7 +131,7 @@ Load pre-trained maskrcnn from torchvision and do tracing
     /venv/apache-tvm-py3.7/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=MaskRCNN_ResNet50_FPN_Weights.COCO_V1`. You can also use `weights=MaskRCNN_ResNet50_FPN_Weights.DEFAULT` to get the most up-to-date weights.
       warnings.warn(msg)
     Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /workspace/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
-
      0%|          | 0.00/170M [00:00<?, ?B/s]
      5%|4         | 7.99M/170M [00:00<00:03, 52.1MB/s]
     11%|#1        | 19.5M/170M [00:00<00:01, 84.4MB/s]
     17%|#7        | 29.2M/170M [00:00<00:01, 86.9MB/s]
     22%|##2       | 37.9M/170M [00:00<00:01, 86.2MB/s]
     28%|##8       | 48.0M/170M [00:00<00:01, 90.3MB/s]
     33%|###3      | 56.8M/170M [00:00<00:01, 90.7MB/s]
     41%|####1     | 69.7M/170M [00:00<00:01, 105MB/s] 
     47%|####7     | 80.0M/170M [00:00<00:00, 96.3MB/s]
     53%|#####2    | 89.4M/170M [00:01<00:00, 96.8MB/s]
     59%|#####9    | 101M/170M [00:01<00:00, 103MB/s]  
     65%|######5   | 110M/170M [00:01<00:00, 98.0MB/s]
     71%|#######   | 120M/170M [00:01<00:00, 91.4MB/s]
     76%|#######5  | 129M/170M [00:01<00:00, 89.9MB/s]
     82%|########1 | 139M/170M [00:01<00:00, 93.4MB/s]
     88%|########8 | 150M/170M [00:01<00:00, 99.5MB/s]
     94%|#########3| 159M/170M [00:01<00:00, 98.1MB/s]
     99%|#########9| 169M/170M [00:01<00:00, 84.6MB/s]
    
 100%|##########| 170M/170M [00:01<00:00, 91.5MB/s]
+
      0%|          | 0.00/170M [00:00<?, ?B/s]
      7%|7         | 12.1M/170M [00:00<00:01, 127MB/s]
     14%|#4        | 24.2M/170M [00:00<00:01, 109MB/s]
     20%|##        | 34.8M/170M [00:00<00:01, 106MB/s]
     26%|##6       | 44.9M/170M [00:00<00:01, 104MB/s]
     32%|###2      | 54.9M/170M [00:00<00:01, 102MB/s]
     38%|###8      | 64.6M/170M [00:00<00:01, 96.6MB/s]
     45%|####4     | 75.7M/170M [00:00<00:00, 102MB/s] 
     50%|#####     | 85.6M/170M [00:00<00:00, 102MB/s]
     56%|#####6    | 95.3M/170M [00:00<00:00, 101MB/s]
     62%|######1   | 105M/170M [00:01<00:00, 95.3MB/s]
     69%|######8   | 116M/170M [00:01<00:00, 102MB/s] 
     74%|#######4  | 126M/170M [00:01<00:00, 102MB/s]
     80%|########  | 136M/170M [00:01<00:00, 95.7MB/s]
     87%|########6 | 148M/170M [00:01<00:00, 103MB/s] 
     93%|#########2| 157M/170M [00:01<00:00, 101MB/s]
     98%|#########8| 167M/170M [00:01<00:00, 101MB/s]
    100%|##########| 170M/170M [00:01<00:00, 101MB/s]
     /venv/apache-tvm-py3.7/lib/python3.7/site-packages/torch/nn/functional.py:3897: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
       for i in range(dim)
     /venv/apache-tvm-py3.7/lib/python3.7/site-packages/torchvision/models/detection/anchor_utils.py:124: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
@@ -296,7 +300,7 @@ Get boxes with score larger than 0.9
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 3 minutes  13.346 seconds)
+   **Total running time of the script:** ( 3 minutes  7.575 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_object_detection_pytorch.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
index 43660bbb67..add8bc8432 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/deploy_models/deploy_prequantized.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_prequantized.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_prequantized.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/c20f81a94729f461f33b52cc110fd9d6/deploy_prequantized.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -236,7 +240,7 @@ training. Other models require a full post training calibration.
     /venv/apache-tvm-py3.7/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=MobileNet_V2_Weights.IMAGENET1K_V1`. You can also use `weights=MobileNet_V2_Weights.DEFAULT` to get the most up-to-date weights.
       warnings.warn(msg)
     Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /workspace/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
-
      0%|          | 0.00/13.6M [00:00<?, ?B/s]
     63%|######2   | 8.48M/13.6M [00:00<00:00, 88.8MB/s]
    100%|##########| 13.6M/13.6M [00:00<00:00, 111MB/s] 
+
      0%|          | 0.00/13.6M [00:00<?, ?B/s]
    100%|##########| 13.6M/13.6M [00:00<00:00, 149MB/s]
 
 
 
@@ -418,7 +422,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      90.5628      90.5638      91.8590      90.1024       0.2746   
+      90.0417      89.9371      93.6422      89.7781       0.4685   
                
 
 
@@ -467,7 +471,7 @@ TODO
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  5.930 seconds)
+   **Total running time of the script:** ( 1 minutes  4.745 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_prequantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
index 828e22a851..43f4de2a89 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/deploy_models/deploy_prequantized_tflite.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_prequantized_tflite.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_prequantized_tflite.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/1a26d790f7b98309d730181290dae3ee/deploy_prequantized_tflite.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -432,7 +436,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      120.8426     120.7696     125.6792     119.9985      0.5769   
+      118.5336     118.5244     124.8535     116.4806      1.0182   
                
 
 
@@ -469,7 +473,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  27.419 seconds)
+   **Total running time of the script:** ( 2 minutes  34.865 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_prequantized_tflite.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
index 3de3fb3852..8ef23d1187 100644
--- a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/deploy_models/deploy_quantized.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_quantized.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_quantized.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/a269cb38341b190be980a0bd3ea8a625/deploy_quantized.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -253,7 +257,7 @@ We create a Relay VM to build and execute the model.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  26.967 seconds)
+   **Total running time of the script:** ( 1 minutes  28.151 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_quantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_sparse.rst.txt b/docs/_sources/how_to/deploy_models/deploy_sparse.rst.txt
index 15c8af0e03..ce5574fa68 100644
--- a/docs/_sources/how_to/deploy_models/deploy_sparse.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_sparse.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/deploy_models/deploy_sparse.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_sparse.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_sparse.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/0b60295044fd20226a0d5adc52b50b2f/deploy_sparse.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
index 5617557ca9..853b1a6679 100644
--- a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/deploy_models/deploy_ssd_gluoncv.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_ssd_gluoncv.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_deploy_models_deploy_ssd_gluoncv.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/d92aacfae35477bed0f7f60aa8d2714e/deploy_ssd_gluoncv.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -166,7 +170,7 @@ Convert and compile model for CPU.
             data: None
       input_sym_arg_type = in_param.infer_type()[0]
     Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/ssd_512_resnet50_v1_voc-9c8b225a.zip...
-
      0%|          | 0/132723 [00:00<?, ?KB/s]
      2%|1         | 2527/132723 [00:00<00:05, 24674.44KB/s]
      6%|5         | 7375/132723 [00:00<00:03, 37134.57KB/s]
      9%|8         | 11754/132723 [00:00<00:03, 40094.56KB/s]
     15%|#4        | 19786/132723 [00:00<00:02, 55725.17KB/s]
     22%|##1       | 28561/132723 [00:00<00:01, 67155.32KB/s]
     28%|##8       | 37311/132723 [00:00<00:01, 74023.57KB/s]
     35%|###4      | 46072/132723 [00:00<00:01, 78443.05KB/s]
     41%|####1     | 54910/132723 [00:00<00:00, 81592.92KB/s]
     48%|####8     | 63744/132723 [00:00<00:00, 83695.61KB/s]
     55%|#####4    | 72602/132723 [00:01<00:00, 85197.31KB/s]
     61%|######1   | 81383/132723 [00:01<00:00, 85992.38KB/s]
     68%|######8   | 90317/132723 [00:01<00:00, 87007.34KB/s]
     75%|#######4  | 99204/132723 [00:01<00:00, 87568.15KB/s]
     81%|########1 | 108062/132723 [00:01<00:00, 87861.21KB/s]
     88%|########8 | 116958/132723 [00:01<00:00, 88189.78KB/s]
     95%|#########4
 | 125789/132723 [00:01<00:00, 88221.68KB/s]
    100%|##########| 132723/132723 [00:01<00:00, 78554.66KB/s]
+
      0%|          | 0/132723 [00:00<?, ?KB/s]
      4%|3         | 4845/132723 [00:00<00:02, 48441.85KB/s]
      9%|9         | 12273/132723 [00:00<00:01, 63638.23KB/s]
     15%|#5        | 19975/132723 [00:00<00:01, 69746.15KB/s]
     21%|##        | 27663/132723 [00:00<00:01, 72561.03KB/s]
     27%|##6       | 35358/132723 [00:00<00:01, 74141.90KB/s]
     32%|###2      | 43043/132723 [00:00<00:01, 75061.67KB/s]
     38%|###8      | 50775/132723 [00:00<00:01, 75798.55KB/s]
     44%|####4     | 58517/132723 [00:00<00:00, 76310.80KB/s]
     50%|####9     | 66263/132723 [00:00<00:00, 76667.55KB/s]
     56%|#####5    | 74045/132723 [00:01<00:00, 77021.35KB/s]
     62%|######1   | 81763/132723 [00:01<00:00, 77067.06KB/s]
     67%|######7   | 89511/132723 [00:01<00:00, 77188.55KB/s]
     73%|#######3  | 97258/132723 [00:01<00:00, 77271.70KB/s]
     79%|#######9  | 105030/132723 [00:01<00:00, 77402.61KB/s]
     85%|########4 | 112797/132723 [00:01<00:00, 77478.21KB/s]
     91%|#########
  | 120577/132723 [00:01<00:00, 77572.35KB/s]
     97%|#########6| 128410/132723 [00:01<00:00, 77795.38KB/s]
    100%|##########| 132723/132723 [00:01<00:00, 75605.07KB/s]
 
 
 
@@ -242,7 +246,7 @@ Display result
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 3 minutes  5.577 seconds)
+   **Total running time of the script:** ( 3 minutes  3.701 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_ssd_gluoncv.py:
diff --git a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
index d0ce875d0c..4ae6734fa3 100644
--- a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
@@ -5,26 +5,26 @@
 
 Computation times
 =================
-**13:46.319** total execution time for **how_to_deploy_models** files:
+**13:44.485** total execution time for **how_to_deploy_models** files:
 
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``) | 03:13.346 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``) | 03:07.575 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)                           | 03:05.577 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)                           | 03:03.701 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)           | 02:27.419 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)           | 02:34.865 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)                               | 01:26.967 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)                               | 01:28.151 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)                         | 01:05.930 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)                         | 01:04.745 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_adreno.py` (``deploy_model_on_adreno.py``)                   | 01:01.004 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_adreno.py` (``deploy_model_on_adreno.py``)                   | 01:00.638 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)                 | 00:35.284 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)                 | 00:34.717 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_nano.py` (``deploy_model_on_nano.py``)                       | 00:25.601 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_nano.py` (``deploy_model_on_nano.py``)                       | 00:25.260 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)                       | 00:25.185 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)                       | 00:24.827 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_deploy_models_deploy_sparse.py` (``deploy_sparse.py``)                                     | 00:00.007 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
index e1f2d0afe1..8a86004a43 100644
--- a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/extend_tvm/bring_your_own_datatypes.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_extend_tvm_bring_your_own_datatypes.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_extend_tvm_bring_your_own_datatypes.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/b11795df0596a55e4982bf895d0c8c38/bring_your_own_datatypes.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -50,7 +54,7 @@ Since we do not use any 3rdparty library, there is no setup needed.
 
 If you would like to try this with your own datatype library, first bring the library's functions into the process space with ``CDLL``:
 
-.. code-block :: python
+.. code-block:: python
 
     ctypes.CDLL('my-datatype-lib.so', ctypes.RTLD_GLOBAL)
 
@@ -472,7 +476,7 @@ First let us define two helper functions to get the mobilenet model and a cat im
 
  .. code-block:: none
 
-    Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zipfe35e9ed-d04b-4efd-9a1c-a5b852afb902 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
+    Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zipa0a8e594-1da9-4714-9833-f7e4fd889ea3 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
 
 
 
diff --git a/docs/_sources/how_to/extend_tvm/low_level_custom_pass.rst.txt b/docs/_sources/how_to/extend_tvm/low_level_custom_pass.rst.txt
index 28257ca020..560efb4b38 100644
--- a/docs/_sources/how_to/extend_tvm/low_level_custom_pass.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/low_level_custom_pass.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/extend_tvm/low_level_custom_pass.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_extend_tvm_low_level_custom_pass.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_extend_tvm_low_level_custom_pass.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/d58ec306b89044968adefb49e6552378/low_level_custom_pass.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
index 8793e6c30a..f5dfeac1ef 100644
--- a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
@@ -5,14 +5,14 @@
 
 Computation times
 =================
-**00:47.420** total execution time for **how_to_extend_tvm** files:
+**00:46.260** total execution time for **how_to_extend_tvm** files:
 
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``) | 00:43.973 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``) | 00:42.925 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)           | 00:02.421 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)           | 00:02.327 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)                     | 00:01.018 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)                     | 00:01.000 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_extend_tvm_low_level_custom_pass.py` (``low_level_custom_pass.py``)       | 00:00.007 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/extend_tvm/use_pass_infra.rst.txt b/docs/_sources/how_to/extend_tvm/use_pass_infra.rst.txt
index ff27b305d3..c8b4a9ee87 100644
--- a/docs/_sources/how_to/extend_tvm/use_pass_infra.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/use_pass_infra.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/extend_tvm/use_pass_infra.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_extend_tvm_use_pass_infra.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_extend_tvm_use_pass_infra.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/7ef14586a3b62fe120d97d5fedf72879/use_pass_infra.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
index c1d900d981..9e5808049c 100644
--- a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/extend_tvm/use_pass_instrument.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_extend_tvm_use_pass_instrument.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_extend_tvm_use_pass_instrument.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/f6ff0fbc61d45d2cc0f53ebbf11a5fb5/use_pass_instrument.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -216,10 +220,10 @@ profile the execution time of each passes.
  .. code-block:: none
 
     Printing results of timing profile...
-    InferType: 7140us [7140us] (46.28%; 46.28%)
-    FoldScaleAxis: 8287us [7us] (53.72%; 53.72%)
-            FoldConstant: 8280us [1709us] (53.67%; 99.92%)
-                    InferType: 6571us [6571us] (42.60%; 79.36%)
+    InferType: 7092us [7092us] (46.21%; 46.21%)
+    FoldScaleAxis: 8256us [7us] (53.79%; 53.79%)
+            FoldConstant: 8250us [1701us] (53.75%; 99.92%)
+                    InferType: 6549us [6549us] (42.67%; 79.39%)
 
 
 
@@ -258,10 +262,10 @@ Refer to following sections and :py:func:`tvm.instrument.pass_instrument` for th
  .. code-block:: none
 
     Printing results of timing profile...
-    InferType: 6629us [6629us] (45.04%; 45.04%)
-    FoldScaleAxis: 8088us [5us] (54.96%; 54.96%)
-            FoldConstant: 8083us [1697us] (54.92%; 99.94%)
-                    InferType: 6386us [6386us] (43.39%; 79.00%)
+    InferType: 6621us [6621us] (45.31%; 45.31%)
+    FoldScaleAxis: 7992us [5us] (54.69%; 54.69%)
+            FoldConstant: 7987us [1682us] (54.66%; 99.94%)
+                    InferType: 6305us [6305us] (43.15%; 78.94%)
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
index 3eab26308c..9fa8a5202c 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/optimize_operators/opt_conv_cuda.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_optimize_operators_opt_conv_cuda.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_optimize_operators_opt_conv_cuda.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/854257a66df713b1f3f82eb3577f95e3/opt_conv_cuda.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -44,7 +48,7 @@ channel, batch.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 40-48
+.. GENERATED FROM PYTHON SOURCE LINES 41-49
 
 Preparation and Algorithm
 -------------------------
@@ -55,7 +59,7 @@ of size 3 x 3.  We use stride size 1 and padding size 1 for the
 convolution. The following code defines the convolution algorithm in TVM.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 48-90
+.. GENERATED FROM PYTHON SOURCE LINES 49-91
 
 .. code-block:: default
 
@@ -108,7 +112,7 @@ convolution. The following code defines the convolution algorithm in TVM.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 91-112
+.. GENERATED FROM PYTHON SOURCE LINES 92-113
 
 Memory Hierarchy
 ----------------
@@ -132,7 +136,7 @@ WL. BL is a local cache of output B, which is also stored in the thread local
 registers.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 112-122
+.. GENERATED FROM PYTHON SOURCE LINES 113-123
 
 .. code-block:: default
 
@@ -153,7 +157,7 @@ registers.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 123-139
+.. GENERATED FROM PYTHON SOURCE LINES 124-140
 
 Blocking
 --------
@@ -172,7 +176,7 @@ shared memory.
      :width: 317px
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 139-167
+.. GENERATED FROM PYTHON SOURCE LINES 140-168
 
 .. code-block:: default
 
@@ -211,7 +215,7 @@ shared memory.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 168-181
+.. GENERATED FROM PYTHON SOURCE LINES 169-182
 
 Virtual Thread Split
 --------------------
@@ -227,7 +231,7 @@ each thread computes 4 strided grids, where size of each grid is 4 x 4.
      :width: 268px
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 181-193
+.. GENERATED FROM PYTHON SOURCE LINES 182-194
 
 .. code-block:: default
 
@@ -250,7 +254,7 @@ each thread computes 4 strided grids, where size of each grid is 4 x 4.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 194-202
+.. GENERATED FROM PYTHON SOURCE LINES 195-203
 
 Cooperative Fetching
 --------------------
@@ -261,7 +265,7 @@ transfer per thread, the following code lets threads in the same thread block
 coopertively fetch dependent data from global memory.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 202-238
+.. GENERATED FROM PYTHON SOURCE LINES 203-239
 
 .. code-block:: default
 
@@ -308,7 +312,7 @@ coopertively fetch dependent data from global memory.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 239-245
+.. GENERATED FROM PYTHON SOURCE LINES 240-246
 
 Generate CUDA Kernel
 --------------------
@@ -317,7 +321,7 @@ Finally we use TVM to generate and compile the CUDA kernel, and evaluate the
 latency of convolution.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 245-256
+.. GENERATED FROM PYTHON SOURCE LINES 246-257
 
 .. code-block:: default
 
@@ -340,7 +344,7 @@ latency of convolution.
 
  .. code-block:: none
 
-    Convolution: 54.212608 ms
+    Convolution: 35.913761 ms
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
index 50528715fd..12f86b05f1 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/optimize_operators/opt_conv_tensorcore.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_optimize_operators_opt_conv_tensorcore.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_optimize_operators_opt_conv_tensorcore.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/7455981870c23c8c76482dedf33d8a42/opt_conv_tensorcore.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -158,7 +162,7 @@ NHWCnc memory layout.The following code defines the convolution algorithm in TVM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 152-158
+.. GENERATED FROM PYTHON SOURCE LINES 153-159
 
 Memory Scope
 ------------
@@ -167,7 +171,7 @@ To support TensorCores, we add another three special memory scope: :code:`wmma.m
 :code:`wmma.matrix_b` and :code:`wmma.accumulator`. On hardware, all fragments scope
 stores at the on-chip registers level, the same place with local memory.
 
-.. GENERATED FROM PYTHON SOURCE LINES 158-166
+.. GENERATED FROM PYTHON SOURCE LINES 159-167
 
 .. code-block:: default
 
@@ -186,7 +190,7 @@ stores at the on-chip registers level, the same place with local memory.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 167-176
+.. GENERATED FROM PYTHON SOURCE LINES 168-177
 
 Define Tensor Intrinsic
 -----------------------
@@ -198,7 +202,7 @@ There are four basic operation in TensorCore: :code:`fill_fragment`, :code:`load
 :code:`mma_sync` and :code:`store_matrix`. Since :code:`fill_fragment` and :code:`mma_sync`
 are both used in matrix multiplication, so we can just write following three intrinsics.
 
-.. GENERATED FROM PYTHON SOURCE LINES 176-297
+.. GENERATED FROM PYTHON SOURCE LINES 177-298
 
 .. code-block:: default
 
@@ -330,7 +334,7 @@ are both used in matrix multiplication, so we can just write following three int
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 298-319
+.. GENERATED FROM PYTHON SOURCE LINES 299-320
 
 Scheduling the Computation
 --------------------------
@@ -354,7 +358,7 @@ one time.
   TensorCore intrinsics directly or indirectly. Also note that it is not the unique solution.
   The only thing we should do is to make sure all threads in a warp can call TensorCore at the same time.
 
-.. GENERATED FROM PYTHON SOURCE LINES 319-382
+.. GENERATED FROM PYTHON SOURCE LINES 320-383
 
 .. code-block:: default
 
@@ -528,14 +532,14 @@ one time.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 383-387
+.. GENERATED FROM PYTHON SOURCE LINES 384-388
 
 Lowering Computation to Intrinsics
 ----------------------------------
 The last phase is to lower the computation loops down to TensorCore hardware intrinsics
 by mapping the 2D convolution to tensor intrinsics
 
-.. GENERATED FROM PYTHON SOURCE LINES 387-394
+.. GENERATED FROM PYTHON SOURCE LINES 388-395
 
 .. code-block:: default
 
@@ -624,7 +628,7 @@ by mapping the 2D convolution to tensor intrinsics
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 395-400
+.. GENERATED FROM PYTHON SOURCE LINES 396-401
 
 Generate CUDA Kernel
 --------------------
@@ -632,7 +636,7 @@ Finally we use TVM to generate and compile the CUDA kernel, and evaluate the lat
 Since TensorCores are only supported in NVIDIA GPU with Compute Capability 7.0 or higher, it may not
 be able to run on our build server
 
-.. GENERATED FROM PYTHON SOURCE LINES 400-413
+.. GENERATED FROM PYTHON SOURCE LINES 401-414
 
 .. code-block:: default
 
@@ -657,12 +661,12 @@ be able to run on our build server
 
  .. code-block:: none
 
-    conv2d with tensor core: 13.367702 ms
+    conv2d with tensor core: 13.342461 ms
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 414-418
+.. GENERATED FROM PYTHON SOURCE LINES 415-419
 
 Summary
 -------
diff --git a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
index 94651182a1..efbfe428b6 100644
--- a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/optimize_operators/opt_gemm.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_optimize_operators_opt_gemm.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_optimize_operators_opt_gemm.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/0f8d36b3ffd04a5a08089dc671eb788e/opt_gemm.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -143,8 +147,8 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
 
  .. code-block:: none
 
-    Numpy running time: 0.018344
-    Baseline: 3.447542
+    Numpy running time: 0.018220
+    Baseline: 3.325094
 
 
 
@@ -238,7 +242,7 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
 
  .. code-block:: none
 
-    Opt1: 0.294974
+    Opt1: 0.296584
 
 
 
@@ -340,7 +344,7 @@ In this tutorial, we chose to vectorize the inner loop row data since it is cach
 
  .. code-block:: none
 
-    Opt2: 0.329433
+    Opt2: 0.329013
 
 
 
@@ -435,7 +439,7 @@ the access pattern for A matrix is more cache friendly.
 
  .. code-block:: none
 
-    Opt3: 0.115964
+    Opt3: 0.113831
 
 
 
@@ -559,7 +563,7 @@ flattening.
 
  .. code-block:: none
 
-    Opt4: 0.109243
+    Opt4: 0.109471
 
 
 
@@ -680,7 +684,7 @@ write to C when all the block results are ready.
 
  .. code-block:: none
 
-    Opt5: 0.112300
+    Opt5: 0.110779
 
 
 
@@ -804,7 +808,7 @@ Furthermore, we can also utilize multi-core processors to do the thread-level pa
 
  .. code-block:: none
 
-    Opt6: 0.146850
+    Opt6: 0.146559
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
index 50900006db..49513e5246 100644
--- a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
@@ -5,12 +5,12 @@
 
 Computation times
 =================
-**00:34.828** total execution time for **how_to_optimize_operators** files:
+**00:34.501** total execution time for **how_to_optimize_operators** files:
 
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)                       | 00:32.242 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)                       | 00:31.855 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``) | 00:01.519 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``) | 00:01.541 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)             | 00:01.067 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)             | 00:01.105 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
index 02efde217f..4c7fce352c 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
@@ -5,18 +5,18 @@
 
 Computation times
 =================
-**08:58.722** total execution time for **how_to_tune_with_autoscheduler** files:
+**09:02.544** total execution time for **how_to_tune_with_autoscheduler** files:
 
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``) | 05:34.333 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``) | 05:41.286 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)             | 01:31.718 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)             | 01:30.702 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)           | 01:01.530 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)           | 01:00.923 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)               | 00:27.940 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)               | 00:26.801 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)             | 00:12.065 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)             | 00:11.840 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)           | 00:11.136 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)           | 00:10.991 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
index b9b2477ff2..7fd20a997e 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/5f1f7bd7d90710fd404f7bcdc4965622/tune_conv2d_layer_cuda.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -58,7 +62,7 @@ __name__ == "__main__":` block.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 54-59
+.. GENERATED FROM PYTHON SOURCE LINES 55-60
 
 Define the computation
 ^^^^^^^^^^^^^^^^^^^^^^
@@ -66,7 +70,7 @@ To begin with, let us define the computation of a convolution layer.
 The function should return the list of input/output tensors.
 From these tensors, the auto-scheduler can get the whole computational graph.
 
-.. GENERATED FROM PYTHON SOURCE LINES 59-71
+.. GENERATED FROM PYTHON SOURCE LINES 60-72
 
 .. code-block:: default
 
@@ -89,13 +93,13 @@ From these tensors, the auto-scheduler can get the whole computational graph.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 72-75
+.. GENERATED FROM PYTHON SOURCE LINES 73-76
 
 Create the search task
 ^^^^^^^^^^^^^^^^^^^^^^
 We then create a search task for the last convolution layer in the resnet.
 
-.. GENERATED FROM PYTHON SOURCE LINES 75-88
+.. GENERATED FROM PYTHON SOURCE LINES 76-89
 
 .. code-block:: default
 
@@ -133,7 +137,7 @@ We then create a search task for the last convolution layer in the resnet.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 89-106
+.. GENERATED FROM PYTHON SOURCE LINES 90-107
 
 Next, we set parameters for the auto-scheduler. These parameters
 mainly specify how we do the measurement during the search.
@@ -153,7 +157,7 @@ mainly specify how we do the measurement during the search.
 * see :any:`auto_scheduler.TuningOptions`,
   :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters.
 
-.. GENERATED FROM PYTHON SOURCE LINES 106-116
+.. GENERATED FROM PYTHON SOURCE LINES 107-117
 
 .. code-block:: default
 
@@ -180,7 +184,7 @@ mainly specify how we do the measurement during the search.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 117-123
+.. GENERATED FROM PYTHON SOURCE LINES 118-124
 
 Run the search
 ^^^^^^^^^^^^^^
@@ -189,7 +193,7 @@ We can kick off the search and let the auto-scheduler do its magic.
 After some measurement trials, we can load the best schedule from the log
 file and apply it.
 
-.. GENERATED FROM PYTHON SOURCE LINES 123-132
+.. GENERATED FROM PYTHON SOURCE LINES 124-133
 
 .. code-block:: default
 
@@ -209,13 +213,13 @@ file and apply it.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 133-136
+.. GENERATED FROM PYTHON SOURCE LINES 134-137
 
 We can lower the schedule to see the IR after auto-scheduling.
 The auto-scheduler correctly performs optimizations including multi-level tiling,
 cooperative fetching, unrolling and operator fusion.
 
-.. GENERATED FROM PYTHON SOURCE LINES 136-140
+.. GENERATED FROM PYTHON SOURCE LINES 137-141
 
 .. code-block:: default
 
@@ -240,278 +244,162 @@ cooperative fetching, unrolling and operator fusion.
                  compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], [])}
       buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} {
       attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 64;
-      allocate(conv2d_nchw: Pointer(local float32), float32, [8]), storage_scope = local;
-      allocate(pad_temp.shared: Pointer(shared float32), float32, [1296]), storage_scope = shared;
-      allocate(kernel.shared: Pointer(shared float32), float32, [1152]), storage_scope = shared;
-      attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-        conv2d_nchw_1: Buffer(conv2d_nchw, float32, [8], [], scope="local", align=32)[0] = 0f32
+      allocate(conv2d_nchw: Pointer(local float32), float32, [4]), storage_scope = local;
+      allocate(pad_temp.shared: Pointer(shared float32), float32, [4032]), storage_scope = shared;
+      allocate(kernel.shared: Pointer(shared float32), float32, [1536]), storage_scope = shared;
+      attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98 {
+        conv2d_nchw_1: Buffer(conv2d_nchw, float32, [1], [], scope="local", align=4)[0] = 0f32
         conv2d_nchw_1[1] = 0f32
         conv2d_nchw_1[2] = 0f32
         conv2d_nchw_1[3] = 0f32
-        conv2d_nchw_1[4] = 0f32
-        conv2d_nchw_1[5] = 0f32
-        conv2d_nchw_1[6] = 0f32
-        conv2d_nchw_1[7] = 0f32
-        for (rc.outer.outer: int32, 0, 32) {
-          let cse_var_2: int32 = (rc.outer.outer*784)
-          let cse_var_1: int32 = (rc.outer.outer*144)
-           {
-            attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1: Buffer(pad_temp.shared, float32, [1296], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else((((9 <= threadIdx.x_1) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data_3: Buffer(data_2, float32, [25088], [])[(((cse_var_2 + (floordiv(threadIdx.x_1, 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 49)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 49), 81)) && (floormod((threadIdx.x_1 + 49), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 49), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 49), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 98)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 8), 9)) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 98), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 17), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 147)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 66), 81)) && (floormod((threadIdx.x_1 + 66), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 147), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 66), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 196)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 34), 81)) && (floormod((threadIdx.x_1 + 34), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 196), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 34), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 245)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 2), 81)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 245), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 2), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 294)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 51), 81)) && (floormod((threadIdx.x_1 + 51), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 294), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 51), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 343)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 1), 9)) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 343), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 19), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 68), 81)) && (floormod((threadIdx.x_1 + 68), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 392), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 68), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 441)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 9) + 4), 9)) && (floormod((threadIdx.x_1 + 36), 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 441), 81)*49)) + (floormod((floordiv(threadIdx.x_1, 9) + 4), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 490)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 4), 81)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 490), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 4), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 539)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 53), 81)) && (floormod((threadIdx.x_1 + 53), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 539), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 53), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 588)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 3), 9)) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 588), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 21), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 637)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 70), 81)) && (floormod((threadIdx.x_1 + 70), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 637), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 70), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 686)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 38), 81)) && (floormod((threadIdx.x_1 + 38), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 686), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 38), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 735)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 6), 81)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 735), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 6), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 55), 81)) && (floormod((threadIdx.x_1 + 55), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 784), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 55), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 833)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 5), 9)) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 833), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 23), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 882)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 9) + 8), 9)) && (floormod((threadIdx.x_1 + 72), 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 882), 81)*49)) + (floormod((floordiv(threadIdx.x_1, 9) + 8), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 931)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 40), 81)) && (floormod((threadIdx.x_1 + 40), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 931), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 40), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 980)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 8), 81)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 980), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 8), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1029)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 57), 81)) && (floormod((threadIdx.x_1 + 57), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1029), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 57), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1078)] = @tir.if_then_else((((threadIdx.x_1 < 47) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1078), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 25), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1127)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 74), 81)) && (floormod((threadIdx.x_1 + 74), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1127), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 74), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1176)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 42), 81)) && (floormod((threadIdx.x_1 + 42), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1176), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 42), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1225)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 1), 9)) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1225), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 10), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            if @tir.likely((threadIdx.x_1 < 22), dtype=bool) {
-              pad_temp.shared_1[(threadIdx.x_1 + 1274)] = @tir.if_then_else((((threadIdx.x_1 < 13) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data_3[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1274), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 59), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              kernel.shared_1: Buffer(kernel.shared, float32, [1152], [], scope="shared")[(threadIdx.x_2*8)] = kernel_3: Buffer(kernel_2, float32, [2359296], [])[(((((blockIdx.x*36864) + (floordiv(threadIdx.x_2, 18)*4608)) + cse_var_1) + (floordiv((floormod(threadIdx.x_2, 18)*8), 3)*3)) + floormod((threadIdx.x_2*2), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 1)] = kernel_3[(((((blockIdx.x*36864) + (floordiv(threadIdx.x_2, 18)*4608)) + cse_var_1) + (floordiv(((floormod(threadIdx.x_2, 18)*8) + 1), 3)*3)) + floormod(((threadIdx.x_2*2) + 1), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 2)] = kernel_3[(((((blockIdx.x*36864) + (floordiv(threadIdx.x_2, 18)*4608)) + cse_var_1) + (floordiv(((floormod(threadIdx.x_2, 18)*8) + 2), 3)*3)) + floormod(((threadIdx.x_2*2) + 2), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 3)] = kernel_3[(((((blockIdx.x*36864) + (floordiv(threadIdx.x_2, 18)*4608)) + cse_var_1) + (floormod((floordiv((threadIdx.x_2*8), 3) + 1), 48)*3)) + floormod((threadIdx.x_2*2), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 4)] = kernel_3[(((((blockIdx.x*36864) + (floordiv(threadIdx.x_2, 18)*4608)) + cse_var_1) + (floordiv(((floormod(threadIdx.x_2, 18)*8) + 4), 3)*3)) + floormod(((threadIdx.x_2*2) + 1), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 5)] = kernel_3[(((((blockIdx.x*36864) + (floordiv(threadIdx.x_2, 18)*4608)) + cse_var_1) + (floordiv(((floormod(threadIdx.x_2, 18)*8) + 5), 3)*3)) + floormod(((threadIdx.x_2*2) + 2), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 6)] = kernel_3[(((((blockIdx.x*36864) + (floordiv(threadIdx.x_2, 18)*4608)) + cse_var_1) + (floormod((floordiv((threadIdx.x_2*8), 3) + 2), 48)*3)) + floormod((threadIdx.x_2*2), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 7)] = kernel_3[(((((blockIdx.x*36864) + (floordiv(threadIdx.x_2, 18)*4608)) + cse_var_1) + (floordiv(((floormod(threadIdx.x_2, 18)*8) + 7), 3)*3)) + floormod(((threadIdx.x_2*2) + 1), 3))]
-            }
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              kernel.shared_1[((threadIdx.x_2*8) + 392)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 49), 18)*4608)) + cse_var_1) + (floordiv(floormod(((threadIdx.x_2*8) + 104), 144), 3)*3)) + floormod(((threadIdx.x_2*2) + 2), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 393)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 49), 18)*4608)) + cse_var_1) + (floormod((floordiv((threadIdx.x_2*8), 3) + 35), 48)*3)) + floormod((threadIdx.x_2*2), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 394)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 49), 18)*4608)) + cse_var_1) + (floordiv(floormod(((threadIdx.x_2*8) + 106), 144), 3)*3)) + floormod(((threadIdx.x_2*2) + 1), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 395)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 49), 18)*4608)) + cse_var_1) + (floormod((floordiv(((threadIdx.x_2*8) + 392), 3) + 1), 48)*3)) + floormod(((threadIdx.x_2*2) + 2), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 396)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 49), 18)*4608)) + cse_var_1) + (floormod((floordiv((threadIdx.x_2*8), 3) + 36), 48)*3)) + floormod((threadIdx.x_2*2), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 397)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 49), 18)*4608)) + cse_var_1) + (floordiv(floormod(((threadIdx.x_2*8) + 109), 144), 3)*3)) + floormod(((threadIdx.x_2*2) + 1), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 398)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 49), 18)*4608)) + cse_var_1) + (floormod((floordiv(((threadIdx.x_2*8) + 392), 3) + 2), 48)*3)) + floormod(((threadIdx.x_2*2) + 2), 3))]
-              kernel.shared_1[((threadIdx.x_2*8) + 399)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 49), 18)*4608)) + cse_var_1) + (floormod((floordiv((threadIdx.x_2*8), 3) + 37), 48)*3)) + floormod((threadIdx.x_2*2), 3))]
-            }
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              if @tir.likely((threadIdx.x_2 < 46), dtype=bool) {
-                kernel.shared_1[((threadIdx.x_2*8) + 784)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 98), 18)*4608)) + cse_var_1) + (floordiv(floormod(((threadIdx.x_2*8) + 64), 144), 3)*3)) + floormod(((threadIdx.x_2*2) + 1), 3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 46), dtype=bool) {
-                kernel.shared_1[((threadIdx.x_2*8) + 785)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 98), 18)*4608)) + cse_var_1) + (floordiv(floormod(((threadIdx.x_2*8) + 65), 144), 3)*3)) + floormod(((threadIdx.x_2*2) + 2), 3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 46), dtype=bool) {
-                kernel.shared_1[((threadIdx.x_2*8) + 786)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 98), 18)*4608)) + cse_var_1) + (floormod((floordiv((threadIdx.x_2*8), 3) + 22), 48)*3)) + floormod((threadIdx.x_2*2), 3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 46), dtype=bool) {
-                kernel.shared_1[((threadIdx.x_2*8) + 787)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 98), 18)*4608)) + cse_var_1) + (floormod((floordiv(((threadIdx.x_2*8) + 784), 3) + 1), 48)*3)) + floormod(((threadIdx.x_2*2) + 1), 3))]
+        for (rc.outer.outer: int32, 0, 8) {
+          for (ry.outer.outer: int32, 0, 3) {
+            let cse_var_4: int32 = (rc.outer.outer*3136)
+            let cse_var_3: int32 = (ry.outer.outer*7)
+            let cse_var_2: int32 = (rc.outer.outer*576)
+            let cse_var_1: int32 = (ry.outer.outer*3)
+             {
+              attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1: Buffer(pad_temp.shared, float32, [4032], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else(((((1 <= (floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer)) && ((floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data_3: Buffer(data_2, float32, [25088], [])[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) - 8)], 0f32 [...]
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 98)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 98), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 196)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 196), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 294)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 294), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 14), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 14), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 392), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 490)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 490), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 588)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 21), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 21), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 588), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 686)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 686), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 784), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 882)] = @tir.if_then_else(((((1 <= (floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer)) && ((floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data_3[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) + 678)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 980)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 980), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 1078)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 1078), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 1176)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 1176), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 1274)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 14), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 14), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 1274), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 1372)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 1372), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 1470)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 21), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 21), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 1470), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 1568)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 1568), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 1666)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 1666), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 1764)] = @tir.if_then_else(((((1 <= (floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer)) && ((floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data_3[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) + 1364)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 1862)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 1862), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 1960)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 1960), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 2058)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 2058), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 2156)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 14), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 14), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 2156), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 2254)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 2254), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 2352)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 21), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 21), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 2352), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 2450)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 2450), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 2548)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 2548), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 2646)] = @tir.if_then_else(((((1 <= (floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer)) && ((floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data_3[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) + 2050)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 2744)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 2744), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 2842)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 2842), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 2940)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 2940), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 3038)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 14), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 14), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 3038), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 3136)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 3136), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 3234)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 21), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 21), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 3234), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 3332)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 3332), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 3430)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 3430), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 3528)] = @tir.if_then_else(((((1 <= (floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer)) && ((floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data_3[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) + 2736)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 3626)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 3626), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 3724)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 3724), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 3822)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 3822), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              pad_temp.shared_1[(threadIdx.x_1 + 3920)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 14), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 14), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 3920), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              if @tir.likely((threadIdx.x_1 < 14), dtype=bool) {
+                pad_temp.shared_1[(threadIdx.x_1 + 4018)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) < 8) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data_3[((((cse_var_4 + (floordiv((threadIdx.x_1 + 4018), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
               }
-              if @tir.likely((threadIdx.x_2 < 46), dtype=bool) {
-                kernel.shared_1[((threadIdx.x_2*8) + 788)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 98), 18)*4608)) + cse_var_1) + (floordiv(floormod(((threadIdx.x_2*8) + 68), 144), 3)*3)) + floormod(((threadIdx.x_2*2) + 2), 3))]
+              attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1: Buffer(kernel.shared, float32, [1536], [], scope="shared")[threadIdx.x_2] = kernel_3: Buffer(kernel_2, float32, [2359296], [])[(((((blockIdx.x*36864) + cse_var_2) + (floordiv(threadIdx.x_2, 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 98)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 98), 192)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 98), 192), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 196)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 196), 192)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 4), 192), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 294)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 294), 192)*4608)) + cse_var_2) + (floormod((floordiv(threadIdx.x_2, 3) + 34), 64)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 392)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 392), 192)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 8), 192), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 490)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 490), 192)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 106), 192), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 588)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 588), 192)*4608)) + cse_var_2) + ((floordiv(threadIdx.x_2, 3) + 4)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 686)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 686), 192)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 110), 192), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 784)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 784), 192)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 16), 192), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 882)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 882), 192)*4608)) + cse_var_2) + (floormod((floordiv(threadIdx.x_2, 3) + 38), 64)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 980)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 980), 192)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 20), 192), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 1078)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1078), 192)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 118), 192), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 1176)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1176), 192)*4608)) + cse_var_2) + ((floordiv(threadIdx.x_2, 3) + 8)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 1274)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1274), 192)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 122), 192), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              kernel.shared_1[(threadIdx.x_2 + 1372)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1372), 192)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 28), 192), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 98;
+              if @tir.likely((threadIdx.x_2 < 66), dtype=bool) {
+                kernel.shared_1[(threadIdx.x_2 + 1470)] = kernel_3[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1470), 192)*4608)) + cse_var_2) + ((floordiv(threadIdx.x_2, 3) + 42)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
               }
-              if @tir.likely((threadIdx.x_2 < 46), dtype=bool) {
-                kernel.shared_1[((threadIdx.x_2*8) + 789)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 98), 18)*4608)) + cse_var_1) + (floormod((floordiv((threadIdx.x_2*8), 3) + 23), 48)*3)) + floormod((threadIdx.x_2*2), 3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 46), dtype=bool) {
-                kernel.shared_1[((threadIdx.x_2*8) + 790)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 98), 18)*4608)) + cse_var_1) + (floormod((floordiv(((threadIdx.x_2*8) + 784), 3) + 2), 48)*3)) + floormod(((threadIdx.x_2*2) + 1), 3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 46), dtype=bool) {
-                kernel.shared_1[((threadIdx.x_2*8) + 791)] = kernel_3[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 98), 18)*4608)) + cse_var_1) + (floordiv(floormod(((threadIdx.x_2*8) + 71), 144), 3)*3)) + floormod(((threadIdx.x_2*2) + 2), 3))]
-              }
-            }
-            for (rc.outer.inner: int32, 0, 8) {
-              let cse_var_3: int32 = (rc.outer.inner*18)
-               {
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[cse_var_3]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 144)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 1)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 145)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 2)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 146)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 9)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 153)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 10)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 154)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 11)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 155)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 288)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 432)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 289)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 433)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 290)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 434)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 297)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 441)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 298)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 442)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 299)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 443)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 576)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 720)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 577)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 721)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 578)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 722)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 585)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 729)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 586)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 730)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 587)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 731)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 864)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_3 + 1008)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 865)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(cse_var_3 + 1009)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 866)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(cse_var_3 + 1010)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 873)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_3 + 1017)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 874)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(cse_var_3 + 1018)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 875)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(cse_var_3 + 1019)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 3)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 147)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 4)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 148)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 5)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 149)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 12)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 156)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 13)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 157)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 14)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 158)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 291)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 435)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 292)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 436)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 293)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 437)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 300)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 444)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 301)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 445)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 302)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 446)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 579)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 723)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 580)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 724)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 581)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 725)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 588)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 732)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 589)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 733)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 590)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 734)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 867)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_3 + 1011)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 868)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(cse_var_3 + 1012)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 869)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(cse_var_3 + 1013)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 876)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_3 + 1020)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 877)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(cse_var_3 + 1021)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 878)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(cse_var_3 + 1022)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 6)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 150)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 7)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 151)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 8)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 152)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 15)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 159)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 16)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 160)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 17)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 161)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 294)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 438)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 295)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 439)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 296)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 440)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 303)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 447)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 304)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 448)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 305)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 449)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 582)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 726)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 583)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 727)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 584)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 728)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 591)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 735)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 592)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 736)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 593)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 737)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 870)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_3 + 1014)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 871)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(cse_var_3 + 1015)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 872)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(cse_var_3 + 1016)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 879)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_3 + 1023)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 880)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(cse_var_3 + 1024)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 881)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(threadIdx.x, 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(cse_var_3 + 1025)]))
+              for (rc.inner: int32, 0, 64) {
+                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[((floordiv(threadIdx.x, 49)*192) + (rc.inner*3))]))
+                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*192) + (rc.inner*3)) + 384)]))
+                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*192) + (rc.inner*3)) + 768)]))
+                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*192) + (rc.inner*3)) + 1152)]))
+                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*192) + (rc.inner*3)) + 1)]))
+                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*192) + (rc.inner*3)) + 385)]))
+                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*192) + (rc.inner*3)) + 769)]))
+                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*192) + (rc.inner*3)) + 1153)]))
+                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*192) + (rc.inner*3)) + 2)]))
+                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*192) + (rc.inner*3)) + 386)]))
+                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*192) + (rc.inner*3)) + 770)]))
+                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.inner*63) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*192) + (rc.inner*3)) + 1154)]))
               }
             }
           }
         }
-        for (i1.inner: int32, 0, 8) {
-          compute_3: Buffer(compute_2, float32, [25088], [])[(((blockIdx.x*392) + (i1.inner*49)) + threadIdx.x)] = max((conv2d_nchw_1[i1.inner] + bias_3: Buffer(bias_2, float32, [512], [])[((blockIdx.x*8) + i1.inner)]), 0f32)
-        }
+        compute_3: Buffer(compute_2, float32, [25088], [])[((blockIdx.x*392) + threadIdx.x)] = max((conv2d_nchw_1[0] + bias_3: Buffer(bias_2, float32, [512], [])[((blockIdx.x*8) + floordiv(threadIdx.x, 49))]), 0f32)
+        compute_3[(((blockIdx.x*392) + threadIdx.x) + 98)] = max((conv2d_nchw_1[1] + bias_3[(((blockIdx.x*8) + floordiv(threadIdx.x, 49)) + 2)]), 0f32)
+        compute_3[(((blockIdx.x*392) + threadIdx.x) + 196)] = max((conv2d_nchw_1[2] + bias_3[(((blockIdx.x*8) + floordiv(threadIdx.x, 49)) + 4)]), 0f32)
+        compute_3[(((blockIdx.x*392) + threadIdx.x) + 294)] = max((conv2d_nchw_1[3] + bias_3[(((blockIdx.x*8) + floordiv(threadIdx.x, 49)) + 6)]), 0f32)
       }
     }
 
@@ -520,13 +408,13 @@ cooperative fetching, unrolling and operator fusion.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 141-144
+.. GENERATED FROM PYTHON SOURCE LINES 142-145
 
 Check correctness and evaluate performance
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 We build the binary and check its correctness and performance.
 
-.. GENERATED FROM PYTHON SOURCE LINES 144-171
+.. GENERATED FROM PYTHON SOURCE LINES 145-172
 
 .. code-block:: default
 
@@ -565,12 +453,12 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 0.212 ms
+    Execution time of this operator: 0.280 ms
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 172-177
+.. GENERATED FROM PYTHON SOURCE LINES 173-178
 
 Using the record file
 ^^^^^^^^^^^^^^^^^^^^^
@@ -578,13 +466,13 @@ During the search, all measurement records are dumped into the record
 file "conv2d.json". The measurement records can be used to re-apply search results,
 resume the search, and perform other analyses.
 
-.. GENERATED FROM PYTHON SOURCE LINES 179-182
+.. GENERATED FROM PYTHON SOURCE LINES 180-183
 
 Here is an example where we load the best schedule from a file,
 print the equivalent python schedule API and CUDA source code.
 They can be used for debugging and learning the behavior of the auto-scheduler.
 
-.. GENERATED FROM PYTHON SOURCE LINES 182-189
+.. GENERATED FROM PYTHON SOURCE LINES 183-190
 
 .. code-block:: default
 
@@ -613,10 +501,10 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     conv2d_nchw_nn_o_o_i, conv2d_nchw_nn_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_i, factor=1)
     conv2d_nchw_nn_o_o_o_i, conv2d_nchw_nn_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_i, factor=1)
     conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_o_i, factor=1)
-    conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=2)
-    conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=4)
-    conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=1)
-    conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=1)
+    conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=1)
+    conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=1)
+    conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=2)
+    conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=4)
     conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1)
     conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
     conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=7)
@@ -625,19 +513,19 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=1)
     conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=7)
     conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
-    conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=2)
-    conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=8)
+    conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=64)
+    conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=1)
     conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=1)
-    conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=3)
+    conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
     conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=3)
     conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1)
     s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2 [...]
     compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
     compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
     compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
-    compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=8)
-    compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=1)
-    compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
+    compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=1)
+    compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=2)
+    compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=4)
     compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
     compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
     compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
@@ -660,16 +548,16 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i)
     s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis("threadIdx.x"))
     kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
-    kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=8)
+    kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
     s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-    kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=49)
+    kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=98)
     s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
     pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
     pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
     s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-    pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=49)
+    pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=98)
     s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
-    s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 1024)
+    s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 512)
     s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "unroll_explicit", True)
 
     CUDA source code:
@@ -687,240 +575,100 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
       #define int64_t long long
       #define uint64_t unsigned long long
     #endif
-    extern "C" __global__ void __launch_bounds__(49) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
-      float conv2d_nchw[8];
-      __shared__ float pad_temp_shared[1296];
-      __shared__ float kernel_shared[1152];
+    extern "C" __global__ void __launch_bounds__(98) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
+      float conv2d_nchw[4];
+      __shared__ float pad_temp_shared[4032];
+      __shared__ float kernel_shared[1536];
       conv2d_nchw[0] = 0.000000e+00f;
       conv2d_nchw[1] = 0.000000e+00f;
       conv2d_nchw[2] = 0.000000e+00f;
       conv2d_nchw[3] = 0.000000e+00f;
-      conv2d_nchw[4] = 0.000000e+00f;
-      conv2d_nchw[5] = 0.000000e+00f;
-      conv2d_nchw[6] = 0.000000e+00f;
-      conv2d_nchw[7] = 0.000000e+00f;
-      for (int rc_outer_outer = 0; rc_outer_outer < 32; ++rc_outer_outer) {
-        __syncthreads();
-        pad_temp_shared[((int)threadIdx.x)] = ((((9 <= ((int)threadIdx.x)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 49)] = (((((9 <= ((((int)threadIdx.x) + 49) % 81)) && (((((int)threadIdx.x) + 49) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 49) / 81) * 49)) + ((((((int)threadIdx.x) + 49) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 98)] = (((1 <= ((((int)threadIdx.x) + 8) % 9)) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 98) / 81) * 49)) + (((((int)threadIdx.x) + 17) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 147)] = (((((9 <= ((((int)threadIdx.x) + 66) % 81)) && (((((int)threadIdx.x) + 66) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 147) / 81) * 49)) + ((((((int)threadIdx.x) + 66) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 196)] = (((((9 <= ((((int)threadIdx.x) + 34) % 81)) && (((((int)threadIdx.x) + 34) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 196) / 81) * 49)) + ((((((int)threadIdx.x) + 34) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 245)] = ((((7 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 245) / 81) * 49)) + (((((int)threadIdx.x) + 2) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 294)] = (((((9 <= ((((int)threadIdx.x) + 51) % 81)) && (((((int)threadIdx.x) + 51) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 294) / 81) * 49)) + ((((((int)threadIdx.x) + 51) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 343)] = (((1 <= ((((int)threadIdx.x) + 1) % 9)) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 343) / 81) * 49)) + (((((int)threadIdx.x) + 19) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 392)] = (((((9 <= ((((int)threadIdx.x) + 68) % 81)) && (((((int)threadIdx.x) + 68) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 392) / 81) * 49)) + ((((((int)threadIdx.x) + 68) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 441)] = (((((1 <= (((((int)threadIdx.x) / 9) + 4) % 9)) && (((((int)threadIdx.x) + 36) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 441) / 81) * 49)) + ((((((int)threadIdx.x) / 9) + 4) % 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 490)] = ((((5 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 490) / 81) * 49)) + (((((int)threadIdx.x) + 4) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 539)] = (((((9 <= ((((int)threadIdx.x) + 53) % 81)) && (((((int)threadIdx.x) + 53) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 539) / 81) * 49)) + ((((((int)threadIdx.x) + 53) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 588)] = (((1 <= ((((int)threadIdx.x) + 3) % 9)) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 588) / 81) * 49)) + (((((int)threadIdx.x) + 21) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 637)] = (((((9 <= ((((int)threadIdx.x) + 70) % 81)) && (((((int)threadIdx.x) + 70) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 637) / 81) * 49)) + ((((((int)threadIdx.x) + 70) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 686)] = (((((9 <= ((((int)threadIdx.x) + 38) % 81)) && (((((int)threadIdx.x) + 38) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 686) / 81) * 49)) + ((((((int)threadIdx.x) + 38) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 735)] = ((((3 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 735) / 81) * 49)) + (((((int)threadIdx.x) + 6) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 784)] = (((((9 <= ((((int)threadIdx.x) + 55) % 81)) && (((((int)threadIdx.x) + 55) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 784) / 81) * 49)) + ((((((int)threadIdx.x) + 55) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 833)] = (((1 <= ((((int)threadIdx.x) + 5) % 9)) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 833) / 81) * 49)) + (((((int)threadIdx.x) + 23) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 882)] = (((((1 <= (((((int)threadIdx.x) / 9) + 8) % 9)) && (((((int)threadIdx.x) + 72) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 882) / 81) * 49)) + ((((((int)threadIdx.x) / 9) + 8) % 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 931)] = (((((9 <= ((((int)threadIdx.x) + 40) % 81)) && (((((int)threadIdx.x) + 40) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 931) / 81) * 49)) + ((((((int)threadIdx.x) + 40) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 980)] = ((((1 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 980) / 81) * 49)) + (((((int)threadIdx.x) + 8) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1029)] = (((((9 <= ((((int)threadIdx.x) + 57) % 81)) && (((((int)threadIdx.x) + 57) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1029) / 81) * 49)) + ((((((int)threadIdx.x) + 57) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1078)] = ((((((int)threadIdx.x) < 47) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1078) / 81) * 49)) + (((((int)threadIdx.x) + 25) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1127)] = (((((9 <= ((((int)threadIdx.x) + 74) % 81)) && (((((int)threadIdx.x) + 74) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1127) / 81) * 49)) + ((((((int)threadIdx.x) + 74) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1176)] = (((((9 <= ((((int)threadIdx.x) + 42) % 81)) && (((((int)threadIdx.x) + 42) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1176) / 81) * 49)) + ((((((int)threadIdx.x) + 42) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1225)] = (((1 <= ((((int)threadIdx.x) + 1) % 9)) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1225) / 81) * 49)) + (((((int)threadIdx.x) + 10) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
-        if (((int)threadIdx.x) < 22) {
-          pad_temp_shared[(((int)threadIdx.x) + 1274)] = ((((((int)threadIdx.x) < 13) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 1274) / 81) * 49)) + (((((int)threadIdx.x) + 59) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
-        }
-        kernel_shared[(((int)threadIdx.x) * 8)] = kernel[(((((((int)blockIdx.x) * 36864) + ((((int)threadIdx.x) / 18) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) % 18) * 8) / 3) * 3)) + ((((int)threadIdx.x) * 2) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 1)] = kernel[(((((((int)blockIdx.x) * 36864) + ((((int)threadIdx.x) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) % 18) * 8) + 1) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 1) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 2)] = kernel[(((((((int)blockIdx.x) * 36864) + ((((int)threadIdx.x) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) % 18) * 8) + 2) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 2) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 3)] = kernel[(((((((int)blockIdx.x) * 36864) + ((((int)threadIdx.x) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) / 3) + 1) % 48) * 3)) + ((((int)threadIdx.x) * 2) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 4)] = kernel[(((((((int)blockIdx.x) * 36864) + ((((int)threadIdx.x) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) % 18) * 8) + 4) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 1) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 5)] = kernel[(((((((int)blockIdx.x) * 36864) + ((((int)threadIdx.x) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) % 18) * 8) + 5) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 2) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 6)] = kernel[(((((((int)blockIdx.x) * 36864) + ((((int)threadIdx.x) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) / 3) + 2) % 48) * 3)) + ((((int)threadIdx.x) * 2) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 7)] = kernel[(((((((int)blockIdx.x) * 36864) + ((((int)threadIdx.x) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) % 18) * 8) + 7) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 1) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 392)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 49) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) + 104) % 144) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 2) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 393)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 49) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) / 3) + 35) % 48) * 3)) + ((((int)threadIdx.x) * 2) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 394)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 49) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) + 106) % 144) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 1) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 395)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 49) / 18) * 4608)) + (rc_outer_outer * 144)) + ((((((((int)threadIdx.x) * 8) + 392) / 3) + 1) % 48) * 3)) + (((((int)threadIdx.x) * 2) + 2) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 396)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 49) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) / 3) + 36) % 48) * 3)) + ((((int)threadIdx.x) * 2) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 397)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 49) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) + 109) % 144) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 1) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 398)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 49) / 18) * 4608)) + (rc_outer_outer * 144)) + ((((((((int)threadIdx.x) * 8) + 392) / 3) + 2) % 48) * 3)) + (((((int)threadIdx.x) * 2) + 2) % 3))];
-        kernel_shared[((((int)threadIdx.x) * 8) + 399)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 49) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) / 3) + 37) % 48) * 3)) + ((((int)threadIdx.x) * 2) % 3))];
-        if (((int)threadIdx.x) < 46) {
-          kernel_shared[((((int)threadIdx.x) * 8) + 784)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 98) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) + 64) % 144) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 1) % 3))];
-        }
-        if (((int)threadIdx.x) < 46) {
-          kernel_shared[((((int)threadIdx.x) * 8) + 785)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 98) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) + 65) % 144) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 2) % 3))];
-        }
-        if (((int)threadIdx.x) < 46) {
-          kernel_shared[((((int)threadIdx.x) * 8) + 786)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 98) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) / 3) + 22) % 48) * 3)) + ((((int)threadIdx.x) * 2) % 3))];
-        }
-        if (((int)threadIdx.x) < 46) {
-          kernel_shared[((((int)threadIdx.x) * 8) + 787)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 98) / 18) * 4608)) + (rc_outer_outer * 144)) + ((((((((int)threadIdx.x) * 8) + 784) / 3) + 1) % 48) * 3)) + (((((int)threadIdx.x) * 2) + 1) % 3))];
-        }
-        if (((int)threadIdx.x) < 46) {
-          kernel_shared[((((int)threadIdx.x) * 8) + 788)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 98) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) + 68) % 144) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 2) % 3))];
-        }
-        if (((int)threadIdx.x) < 46) {
-          kernel_shared[((((int)threadIdx.x) * 8) + 789)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 98) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) / 3) + 23) % 48) * 3)) + ((((int)threadIdx.x) * 2) % 3))];
-        }
-        if (((int)threadIdx.x) < 46) {
-          kernel_shared[((((int)threadIdx.x) * 8) + 790)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 98) / 18) * 4608)) + (rc_outer_outer * 144)) + ((((((((int)threadIdx.x) * 8) + 784) / 3) + 2) % 48) * 3)) + (((((int)threadIdx.x) * 2) + 1) % 3))];
-        }
-        if (((int)threadIdx.x) < 46) {
-          kernel_shared[((((int)threadIdx.x) * 8) + 791)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 98) / 18) * 4608)) + (rc_outer_outer * 144)) + (((((((int)threadIdx.x) * 8) + 71) % 144) / 3) * 3)) + (((((int)threadIdx.x) * 2) + 2) % 3))];
-        }
-        __syncthreads();
-        for (int rc_outer_inner = 0; rc_outer_inner < 8; ++rc_outer_inner) {
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[(rc_outer_inner * 18)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 18) + 144)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 18) + 1)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 18) + 145)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 18) + 2)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 18) + 146)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 18) + 9)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 18) + 153)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 18) + 10)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 18) + 154)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 18) + 11)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 18) + 155)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 18) + 288)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 18) + 432)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 18) + 289)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 18) + 433)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 18) + 290)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 18) + 434)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 18) + 297)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 18) + 441)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 18) + 298)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 18) + 442)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 18) + 299)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 18) + 443)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 18) + 576)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 18) + 720)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 18) + 577)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 18) + 721)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 18) + 578)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 18) + 722)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 18) + 585)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 18) + 729)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 18) + 586)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 18) + 730)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 18) + 587)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 18) + 731)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 18) + 864)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((rc_outer_inner * 18) + 1008)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 18) + 865)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((rc_outer_inner * 18) + 1009)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 18) + 866)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((rc_outer_inner * 18) + 1010)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 18) + 873)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((rc_outer_inner * 18) + 1017)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 18) + 874)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((rc_outer_inner * 18) + 1018)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 18) + 875)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((rc_outer_inner * 18) + 1019)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 18) + 3)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 18) + 147)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 18) + 4)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 18) + 148)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 18) + 5)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 18) + 149)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 18) + 12)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 18) + 156)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 18) + 13)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 18) + 157)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 18) + 14)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 18) + 158)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 18) + 291)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 18) + 435)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 18) + 292)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 18) + 436)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 18) + 293)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 18) + 437)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 18) + 300)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 18) + 444)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 18) + 301)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 18) + 445)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 18) + 302)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 18) + 446)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 18) + 579)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 18) + 723)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 18) + 580)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 18) + 724)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 18) + 581)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 18) + 725)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 18) + 588)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 18) + 732)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 18) + 589)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 18) + 733)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 18) + 590)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 18) + 734)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 18) + 867)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((rc_outer_inner * 18) + 1011)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 18) + 868)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((rc_outer_inner * 18) + 1012)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 18) + 869)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((rc_outer_inner * 18) + 1013)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 18) + 876)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((rc_outer_inner * 18) + 1020)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 18) + 877)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((rc_outer_inner * 18) + 1021)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 18) + 878)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((rc_outer_inner * 18) + 1022)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 18) + 6)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 18) + 150)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 18) + 7)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 18) + 151)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 18) + 8)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 18) + 152)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 18) + 15)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 18) + 159)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 18) + 16)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 18) + 160)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 18) + 17)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 18) + 161)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 18) + 294)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 18) + 438)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 18) + 295)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 18) + 439)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 18) + 296)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 18) + 440)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 18) + 303)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 18) + 447)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 18) + 304)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 18) + 448)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 18) + 305)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 18) + 449)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 18) + 582)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 18) + 726)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 18) + 583)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 18) + 727)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 18) + 584)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 18) + 728)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 18) + 591)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 18) + 735)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 18) + 592)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 18) + 736)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 18) + 593)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 18) + 737)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 18) + 870)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((rc_outer_inner * 18) + 1014)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 18) + 871)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((rc_outer_inner * 18) + 1015)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 18) + 872)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((rc_outer_inner * 18) + 1016)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 18) + 879)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((rc_outer_inner * 18) + 1023)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 18) + 880)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((rc_outer_inner * 18) + 1024)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 18) + 881)]));
-          conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((((rc_outer_inner * 162) + ((((int)threadIdx.x) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((rc_outer_inner * 18) + 1025)]));
+      for (int rc_outer_outer = 0; rc_outer_outer < 8; ++rc_outer_outer) {
+        for (int ry_outer_outer = 0; ry_outer_outer < 3; ++ry_outer_outer) {
+          __syncthreads();
+          pad_temp_shared[((int)threadIdx.x)] = (((((1 <= (((((int)threadIdx.x) % 63) / 9) + ry_outer_outer)) && ((((((int)threadIdx.x) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 98)] = (((((1 <= ((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 98) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 196)] = (((((1 <= ((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 196) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 294)] = (((((1 <= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 294) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 392)] = (((((1 <= ((((((int)threadIdx.x) + 14) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 14) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 392) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 490)] = (((((1 <= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 490) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 588)] = (((((1 <= ((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 588) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 686)] = (((((1 <= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 686) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 784)] = (((((1 <= ((((((int)threadIdx.x) + 28) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 28) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 784) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 882)] = (((((1 <= (((((int)threadIdx.x) % 63) / 9) + ry_outer_outer)) && ((((((int)threadIdx.x) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) + 678)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 980)] = (((((1 <= ((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 980) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 1078)] = (((((1 <= ((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 1078) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 1176)] = (((((1 <= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 1176) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 1274)] = (((((1 <= ((((((int)threadIdx.x) + 14) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 14) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 1274) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 1372)] = (((((1 <= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 1372) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 1470)] = (((((1 <= ((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 1470) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 1568)] = (((((1 <= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 1568) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 1666)] = (((((1 <= ((((((int)threadIdx.x) + 28) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 28) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 1666) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 1764)] = (((((1 <= (((((int)threadIdx.x) % 63) / 9) + ry_outer_outer)) && ((((((int)threadIdx.x) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) + 1364)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 1862)] = (((((1 <= ((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 1862) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 1960)] = (((((1 <= ((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 1960) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 2058)] = (((((1 <= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 2058) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 2156)] = (((((1 <= ((((((int)threadIdx.x) + 14) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 14) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 2156) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 2254)] = (((((1 <= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 2254) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 2352)] = (((((1 <= ((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 2352) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 2450)] = (((((1 <= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 2450) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 2548)] = (((((1 <= ((((((int)threadIdx.x) + 28) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 28) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 2548) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 2646)] = (((((1 <= (((((int)threadIdx.x) % 63) / 9) + ry_outer_outer)) && ((((((int)threadIdx.x) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) + 2050)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 2744)] = (((((1 <= ((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 2744) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 2842)] = (((((1 <= ((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 2842) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 2940)] = (((((1 <= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 2940) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 3038)] = (((((1 <= ((((((int)threadIdx.x) + 14) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 14) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 3038) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 3136)] = (((((1 <= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 3136) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 3234)] = (((((1 <= ((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 3234) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 3332)] = (((((1 <= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 3332) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 3430)] = (((((1 <= ((((((int)threadIdx.x) + 28) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 28) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 3430) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 3528)] = (((((1 <= (((((int)threadIdx.x) % 63) / 9) + ry_outer_outer)) && ((((((int)threadIdx.x) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) + 2736)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 3626)] = (((((1 <= ((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 3626) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 3724)] = (((((1 <= ((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 3724) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 3822)] = (((((1 <= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 3822) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 3920)] = (((((1 <= ((((((int)threadIdx.x) + 14) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 14) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 3920) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+          if (((int)threadIdx.x) < 14) {
+            pad_temp_shared[(((int)threadIdx.x) + 4018)] = (((((((((int)threadIdx.x) + 49) / 9) + ry_outer_outer) < 8) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 3136) + (((((int)threadIdx.x) + 4018) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+          }
+          kernel_shared[((int)threadIdx.x)] = kernel[(((((((int)blockIdx.x) * 36864) + (rc_outer_outer * 576)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 98)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 98) / 192) * 4608)) + (rc_outer_outer * 576)) + ((((((int)threadIdx.x) + 98) % 192) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 196)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 196) / 192) * 4608)) + (rc_outer_outer * 576)) + (((((int)threadIdx.x) + 4) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 294)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 294) / 192) * 4608)) + (rc_outer_outer * 576)) + ((((((int)threadIdx.x) / 3) + 34) & 63) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 392)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 392) / 192) * 4608)) + (rc_outer_outer * 576)) + (((((int)threadIdx.x) + 8) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 490)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 490) / 192) * 4608)) + (rc_outer_outer * 576)) + ((((((int)threadIdx.x) + 106) % 192) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 588)] = kernel[(((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 588) / 192) * 4608)) + (rc_outer_outer * 576)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 36)];
+          kernel_shared[(((int)threadIdx.x) + 686)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 686) / 192) * 4608)) + (rc_outer_outer * 576)) + ((((((int)threadIdx.x) + 110) % 192) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 784)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 784) / 192) * 4608)) + (rc_outer_outer * 576)) + (((((int)threadIdx.x) + 16) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 882)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 882) / 192) * 4608)) + (rc_outer_outer * 576)) + ((((((int)threadIdx.x) / 3) + 38) & 63) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 980)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 980) / 192) * 4608)) + (rc_outer_outer * 576)) + (((((int)threadIdx.x) + 20) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 1078)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1078) / 192) * 4608)) + (rc_outer_outer * 576)) + ((((((int)threadIdx.x) + 118) % 192) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 1176)] = kernel[(((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1176) / 192) * 4608)) + (rc_outer_outer * 576)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 72)];
+          kernel_shared[(((int)threadIdx.x) + 1274)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1274) / 192) * 4608)) + (rc_outer_outer * 576)) + ((((((int)threadIdx.x) + 122) % 192) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 1372)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1372) / 192) * 4608)) + (rc_outer_outer * 576)) + (((((int)threadIdx.x) + 28) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          if (((int)threadIdx.x) < 66) {
+            kernel_shared[(((int)threadIdx.x) + 1470)] = kernel[(((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1470) / 192) * 4608)) + (rc_outer_outer * 576)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 378)];
+          }
+          __syncthreads();
+          for (int rc_inner = 0; rc_inner < 64; ++rc_inner) {
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[(((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3))]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3)) + 384)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3)) + 768)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3)) + 1152)]));
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3)) + 1)]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3)) + 385)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3)) + 769)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3)) + 1153)]));
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3)) + 2)]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3)) + 386)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3)) + 770)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_inner * 63) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 49) * 192) + (rc_inner * 3)) + 1154)]));
+          }
         }
       }
-      for (int i1_inner = 0; i1_inner < 8; ++i1_inner) {
-        compute[(((((int)blockIdx.x) * 392) + (i1_inner * 49)) + ((int)threadIdx.x))] = max((conv2d_nchw[i1_inner] + bias[((((int)blockIdx.x) * 8) + i1_inner)]), 0.000000e+00f);
-      }
+      compute[((((int)blockIdx.x) * 392) + ((int)threadIdx.x))] = max((conv2d_nchw[0] + bias[((((int)blockIdx.x) * 8) + (((int)threadIdx.x) / 49))]), 0.000000e+00f);
+      compute[(((((int)blockIdx.x) * 392) + ((int)threadIdx.x)) + 98)] = max((conv2d_nchw[1] + bias[(((((int)blockIdx.x) * 8) + (((int)threadIdx.x) / 49)) + 2)]), 0.000000e+00f);
+      compute[(((((int)blockIdx.x) * 392) + ((int)threadIdx.x)) + 196)] = max((conv2d_nchw[2] + bias[(((((int)blockIdx.x) * 8) + (((int)threadIdx.x) / 49)) + 4)]), 0.000000e+00f);
+      compute[(((((int)blockIdx.x) * 392) + ((int)threadIdx.x)) + 294)] = max((conv2d_nchw[3] + bias[(((((int)blockIdx.x) * 8) + (((int)threadIdx.x) / 49)) + 6)]), 0.000000e+00f);
     }
 
 
@@ -928,14 +676,14 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 190-194
+.. GENERATED FROM PYTHON SOURCE LINES 191-195
 
 A more complicated example is to resume the search.
 In this case, we need to create the search policy and cost model by ourselves
 and resume the status of search policy and cost model with the log file.
 In the example below we resume the status and do more 5 trials.
 
-.. GENERATED FROM PYTHON SOURCE LINES 194-216
+.. GENERATED FROM PYTHON SOURCE LINES 195-217
 
 .. code-block:: default
 
@@ -981,7 +729,7 @@ In the example below we resume the status and do more 5 trials.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 5 minutes  34.333 seconds)
+   **Total running time of the script:** ( 5 minutes  41.286 seconds)
 
 
 .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_arm.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_arm.rst.txt
index bf3bacb2b8..195929e962 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_arm.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_arm.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/tune_with_autoscheduler/tune_network_arm.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_arm.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_arm.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/4dc30a43f3a6aa3ed4bc3077ad35ff70/tune_network_arm.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
index 7606ae7cda..fe7b343ed3 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/tune_with_autoscheduler/tune_network_cuda.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_cuda.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_cuda.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/af264436d049e3cd84803b67b6620b63/tune_network_cuda.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -643,7 +647,7 @@ so we can read the log file and load the best schedules.
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-       7.9021       7.8978       7.9128       7.8957       0.0076   
+       7.8845       7.8834       7.8916       7.8784       0.0055   
                
 
 
@@ -671,7 +675,7 @@ Other Tips
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  1.530 seconds)
+   **Total running time of the script:** ( 1 minutes  0.923 seconds)
 
 
 .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_cuda.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_mali.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_mali.rst.txt
index a4d0f834be..dc6bfdbca8 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_mali.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_mali.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/tune_with_autoscheduler/tune_network_mali.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_mali.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_mali.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/5e4e499c097b16a90c517e630502253a/tune_network_mali.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
index d459728df7..d1aac17678 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/tune_with_autoscheduler/tune_network_x86.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_x86.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_x86.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/ad2a7f55d615d188ad664d56696815a6/tune_network_x86.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -662,7 +666,7 @@ so we can read the log file and load the best schedules.
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      746.6792     746.7520     746.7842     746.5015      0.1264   
+      744.5041     744.4052     745.8627     743.2444      1.0712   
                
 
 
@@ -690,7 +694,7 @@ Other Tips
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  31.718 seconds)
+   **Total running time of the script:** ( 1 minutes  30.702 seconds)
 
 
 .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_x86.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
index 109fbea549..dea7ae50a4 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/tune_with_autoscheduler/tune_sparse_x86.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_sparse_x86.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_sparse_x86.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/293f8d0753933b706a0b588f909fe38a/tune_sparse_x86.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -386,28 +390,24 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
                  placeholder_4: Buffer(placeholder_14: Pointer(float32), float32, [128, 512], []),
                  compute: Buffer(compute_2: Pointer(float32), float32, [128, 512], [])}
       buffer_map = {placeholder_5: placeholder, placeholder_6: placeholder_1, placeholder_7: placeholder_2, placeholder_8: placeholder_3, placeholder_9: placeholder_4, compute_1: compute} {
-      for (i0.outer.i1.outer.fused: int32, 0, 512) "parallel" {
-        allocate(compute_3: Pointer(global float32), float32, [128]), storage_scope = global {
+      for (i0.outer.i1.outer.fused: int32, 0, 2048) "parallel" {
+        allocate(compute_3: Pointer(global float32), float32, [32]), storage_scope = global {
           for (i.outer.inner: int32, 0, 2) {
-            for (i.inner.init: int32, 0, 4) {
-              for (j.init: int32, 0, 16) {
-                compute_4: Buffer(compute_3, float32, [128], [])[(((i.outer.inner*64) + (i.inner.init*16)) + j.init)] = 0f32
-              }
+            for (j.init: int32, 0, 16) {
+              compute_4: Buffer(compute_3, float32, [32], [])[((i.outer.inner*16) + j.init)] = 0f32
             }
             for (elem_idx: int32, 0, let cse_var_1: int32 = floormod(i0.outer.i1.outer.fused, 32) in (placeholder_15: Buffer(placeholder_13, int32, [33], [])[(cse_var_1 + 1)] - placeholder_15[cse_var_1])) {
-              for (i.inner: int32, 0, 4) {
-                for (j: int32, 0, 16) {
-                  let cse_var_2: int32 = floormod(i0.outer.i1.outer.fused, 32)
-                  if @tir.likely((elem_idx < (placeholder_15[(cse_var_2 + 1)] - placeholder_15[cse_var_2])), dtype=bool) {
-                    let cse_var_3: int32 = (((i.outer.inner*64) + (i.inner*16)) + j)
-                    compute_4[cse_var_3] = (compute_4[cse_var_3] + (placeholder_16: Buffer(placeholder_11, float32, [78656], [])[(((placeholder_15[cse_var_2]*16) + (elem_idx*16)) + j)]*max(placeholder_17: Buffer(placeholder_10, float32, [32768], [])[((((floordiv(i0.outer.i1.outer.fused, 32)*2048) + (i.outer.inner*1024)) + (i.inner*256)) + placeholder_18: Buffer(placeholder_12, int32, [4916], [])[(placeholder_15[cse_var_2] + elem_idx)])], 0f32)))
-                  }
+              for (j: int32, 0, 16) {
+                let cse_var_2: int32 = floormod(i0.outer.i1.outer.fused, 32)
+                if @tir.likely((elem_idx < (placeholder_15[(cse_var_2 + 1)] - placeholder_15[cse_var_2])), dtype=bool) {
+                  let cse_var_3: int32 = ((i.outer.inner*16) + j)
+                  compute_4[cse_var_3] = (compute_4[cse_var_3] + (placeholder_16: Buffer(placeholder_11, float32, [78656], [])[(((placeholder_15[cse_var_2]*16) + (elem_idx*16)) + j)]*max(placeholder_17: Buffer(placeholder_10, float32, [32768], [])[(((floordiv(i0.outer.i1.outer.fused, 32)*512) + (i.outer.inner*256)) + placeholder_18: Buffer(placeholder_12, int32, [4916], [])[(placeholder_15[cse_var_2] + elem_idx)])], 0f32)))
                 }
               }
             }
           }
-          for (i0.inner: int32, 0, 8) {
-            let cse_var_4: int32 = (((floordiv(i0.outer.i1.outer.fused, 32)*4096) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 32)*16))
+          for (i0.inner: int32, 0, 2) {
+            let cse_var_4: int32 = (((floordiv(i0.outer.i1.outer.fused, 32)*1024) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 32)*16))
             compute_5: Buffer(compute_2, float32, [65536], [])[ramp(cse_var_4, 1, 16)] = max((compute_4[ramp((i0.inner*16), 1, 16)] + placeholder_19: Buffer(placeholder_14, float32, [65536], [])[ramp(cse_var_4, 1, 16)]), broadcast(0f32, 16))
           }
         }
@@ -464,7 +464,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 1.342 ms
+    Execution time of this operator: 1.901 ms
 
 
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
index 520041404a..bc2bff70aa 100644
--- a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
@@ -5,16 +5,16 @@
 
 Computation times
 =================
-**00:35.303** total execution time for **how_to_tune_with_autotvm** files:
+**00:38.048** total execution time for **how_to_tune_with_autotvm** files:
 
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)           | 00:35.266 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)           | 00:38.012 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)               | 00:00.022 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)               | 00:00.021 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_cuda.py` (``tune_relay_cuda.py``)             | 00:00.005 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``) | 00:00.005 | 0.0 MB |
-+--------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_arm.py` (``tune_relay_arm.py``)               | 00:00.005 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``) | 00:00.005 | 0.0 MB |
++--------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
index d3d28ac0a9..d785a067c3 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/tune_with_autotvm/tune_conv2d_cuda.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_tune_with_autotvm_tune_conv2d_cuda.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_tune_with_autotvm_tune_conv2d_cuda.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/732ed130cbc15432e737da8cc47e1734/tune_conv2d_cuda.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -75,7 +79,7 @@ Now return to python code. Import packages.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 69-91
+.. GENERATED FROM PYTHON SOURCE LINES 70-92
 
 Step 1:  Define the search space
 --------------------------------
@@ -100,7 +104,7 @@ It is worth noting that the search space for a conv2d operator
 can be very large (at the level of 10^9 for some input shapes)
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 91-181
+.. GENERATED FROM PYTHON SOURCE LINES 92-182
 
 .. code-block:: default
 
@@ -201,7 +205,7 @@ can be very large (at the level of 10^9 for some input shapes)
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 182-189
+.. GENERATED FROM PYTHON SOURCE LINES 183-190
 
 Step 2:  Search through the space
 ---------------------------------
@@ -211,7 +215,7 @@ for our case. Here we only do 20 trials for demonstration.
 In practice, making 1000 trials usually can find some good kernels
 for this template
 
-.. GENERATED FROM PYTHON SOURCE LINES 189-218
+.. GENERATED FROM PYTHON SOURCE LINES 190-219
 
 .. code-block:: default
 
@@ -387,9 +391,624 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 32, 1, 1]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 128, 2]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6255265
-    No: 2   GFLOPS: 6.36/6.36       result: MeasureResult(costs=(0.036371310999999996,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.9105820655822754, timestamp=1673056546.0760632)       [('tile_f', [-1, 1, 1, 2]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 32, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,3182795
-    No: 3   GFLOPS: 0.00/6.36       result: Traceback (most recent call last):
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 16, 4, 4]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 32]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6150439
+    No: 2   GFLOPS: 18.74/18.74     result: MeasureResult(costs=(0.012354673555555556,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.9697260856628418, timestamp=1673058803.3903506)       [('tile_f', [-1, 4, 1, 32]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,5303507
+    No: 3   GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
+        func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
+        func = build(s, args, target_host=task.target_host, runtime=runtime)
+      File "/workspace/python/tvm/driver/build_module.py", line 227, in build
+        input_mod = lower(inputs, args, name=name, binds=binds)
+      File "/workspace/python/tvm/driver/build_module.py", line 134, in lower
+        return ffi.lower_schedule(inp, args, name, binds, simple_mode)
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 331, in tvm._ffi._cy3.core.PackedFuncBase.__call__
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 276, in tvm._ffi._cy3.core.FuncCall
+      File "tvm/_ffi/_cython/./base.pxi", line 181, in tvm._ffi._cy3.core.CHECK_CALL
+    tvm._ffi.base.TVMError: Traceback (most recent call last):
+      24: TVMFuncCall
+            at ../src/runtime/c_runtime_api.cc:477
+      23: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      22: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      21: operator()
+            at ../include/tvm/runtime/packed_func.h:1730
+      20: unpack_call<tvm::IRModule, 5, tvm::<lambda(tvm::te::Schedule, const tvm::runtime::Array<tvm::runtime::ObjectRef>&, const tvm::runtime::String&, const tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer>&, bool)> >
+            at ../include/tvm/runtime/packed_func.h:1670
+      19: run<>
+            at ../include/tvm/runtime/packed_func.h:1630
+      18: run<tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      17: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      16: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      15: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      14: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1645
+      13: operator()
+            at ../src/driver/driver_api.cc:395
+      12: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, tvm::GlobalVarSupply, bool)
+            at ../src/driver/driver_api.cc:381
+      11: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
+            at ../src/driver/driver_api.cc:276
+      10: tvm::transform::Pass::operator()(tvm::IRModule) const
+            at ../src/ir/transform.cc:258
+      9: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      8: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:454
+      7: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      6: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/tir/ir/transform.cc:100
+      5: tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::operator()(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext) const
+            at ../include/tvm/runtime/packed_func.h:1749
+      4: tvm::tir::PrimFunc tvm::runtime::detail::typed_packed_call_dispatcher<tvm::tir::PrimFunc>::run<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::runtime::PackedFunc const&, tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&)
+            at ../include/tvm/runtime/packed_func.h:1693
+      3: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&) const
+            at ../include/tvm/runtime/packed_func.h:1617
+      2: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      1: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      0: operator()
+            at ../src/runtime/c_runtime_api.cc:534
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
+        raise InstantiationError("Skipped because of invalid gpu kernel")
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel
+
+    Traceback (most recent call last):
+      24: TVMFuncCall
+            at ../src/runtime/c_runtime_api.cc:477
+      23: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      22: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      21: operator()
+            at ../include/tvm/runtime/packed_func.h:1730
+      20: unpack_call<tvm::IRModule, 5, tvm::<lambda(tvm::te::Schedule, const tvm::runtime::Array<tvm::runtime::ObjectRef>&, const tvm::runtime::String&, const tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer>&, bool)> >
+            at ../include/tvm/runtime/packed_func.h:1670
+      19: run<>
+            at ../include/tvm/runtime/packed_func.h:1630
+      18: run<tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      17: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      16: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      15: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      14: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1645
+      13: operator()
+            at ../src/driver/driver_api.cc:395
+      12: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, tvm::GlobalVarSupply, bool)
+            at ../src/driver/driver_api.cc:381
+      11: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
+            at ../src/driver/driver_api.cc:276
+      10: tvm::transform::Pass::operator()(tvm::IRModule) const
+            at ../src/ir/transform.cc:258
+      9: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      8: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:454
+      7: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      6: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/tir/ir/transform.cc:100
+      5: tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::operator()(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext) const
+            at ../include/tvm/runtime/packed_func.h:1749
+      4: tvm::tir::PrimFunc tvm::runtime::detail::typed_packed_call_dispatcher<tvm::tir::PrimFunc>::run<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::runtime::PackedFunc const&, tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&)
+            at ../include/tvm/runtime/packed_func.h:1693
+      3: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&) const
+            at ../include/tvm/runtime/packed_func.h:1617
+      2: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      1: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      0: operator()
+            at ../src/runtime/c_runtime_api.cc:534
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
+        raise InstantiationError("Skipped because of invalid gpu kernel")
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 4, 16]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 32, 16]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,10012157
+    No: 4   GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
+        func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
+        func = build(s, args, target_host=task.target_host, runtime=runtime)
+      File "/workspace/python/tvm/driver/build_module.py", line 227, in build
+        input_mod = lower(inputs, args, name=name, binds=binds)
+      File "/workspace/python/tvm/driver/build_module.py", line 134, in lower
+        return ffi.lower_schedule(inp, args, name, binds, simple_mode)
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 331, in tvm._ffi._cy3.core.PackedFuncBase.__call__
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 276, in tvm._ffi._cy3.core.FuncCall
+      File "tvm/_ffi/_cython/./base.pxi", line 181, in tvm._ffi._cy3.core.CHECK_CALL
+    tvm._ffi.base.TVMError: Traceback (most recent call last):
+      24: TVMFuncCall
+            at ../src/runtime/c_runtime_api.cc:477
+      23: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      22: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      21: operator()
+            at ../include/tvm/runtime/packed_func.h:1730
+      20: unpack_call<tvm::IRModule, 5, tvm::<lambda(tvm::te::Schedule, const tvm::runtime::Array<tvm::runtime::ObjectRef>&, const tvm::runtime::String&, const tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer>&, bool)> >
+            at ../include/tvm/runtime/packed_func.h:1670
+      19: run<>
+            at ../include/tvm/runtime/packed_func.h:1630
+      18: run<tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      17: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      16: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      15: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      14: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1645
+      13: operator()
+            at ../src/driver/driver_api.cc:395
+      12: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, tvm::GlobalVarSupply, bool)
+            at ../src/driver/driver_api.cc:381
+      11: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
+            at ../src/driver/driver_api.cc:276
+      10: tvm::transform::Pass::operator()(tvm::IRModule) const
+            at ../src/ir/transform.cc:258
+      9: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      8: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:454
+      7: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      6: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/tir/ir/transform.cc:100
+      5: tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::operator()(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext) const
+            at ../include/tvm/runtime/packed_func.h:1749
+      4: tvm::tir::PrimFunc tvm::runtime::detail::typed_packed_call_dispatcher<tvm::tir::PrimFunc>::run<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::runtime::PackedFunc const&, tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&)
+            at ../include/tvm/runtime/packed_func.h:1693
+      3: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&) const
+            at ../include/tvm/runtime/packed_func.h:1617
+      2: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      1: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      0: operator()
+            at ../src/runtime/c_runtime_api.cc:534
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
+        raise InstantiationError("Skipped because of invalid gpu kernel")
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel
+
+    Traceback (most recent call last):
+      24: TVMFuncCall
+            at ../src/runtime/c_runtime_api.cc:477
+      23: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      22: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      21: operator()
+            at ../include/tvm/runtime/packed_func.h:1730
+      20: unpack_call<tvm::IRModule, 5, tvm::<lambda(tvm::te::Schedule, const tvm::runtime::Array<tvm::runtime::ObjectRef>&, const tvm::runtime::String&, const tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer>&, bool)> >
+            at ../include/tvm/runtime/packed_func.h:1670
+      19: run<>
+            at ../include/tvm/runtime/packed_func.h:1630
+      18: run<tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      17: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      16: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      15: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      14: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1645
+      13: operator()
+            at ../src/driver/driver_api.cc:395
+      12: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, tvm::GlobalVarSupply, bool)
+            at ../src/driver/driver_api.cc:381
+      11: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
+            at ../src/driver/driver_api.cc:276
+      10: tvm::transform::Pass::operator()(tvm::IRModule) const
+            at ../src/ir/transform.cc:258
+      9: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      8: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:454
+      7: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      6: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/tir/ir/transform.cc:100
+      5: tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::operator()(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext) const
+            at ../include/tvm/runtime/packed_func.h:1749
+      4: tvm::tir::PrimFunc tvm::runtime::detail::typed_packed_call_dispatcher<tvm::tir::PrimFunc>::run<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::runtime::PackedFunc const&, tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&)
+            at ../include/tvm/runtime/packed_func.h:1693
+      3: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&) const
+            at ../include/tvm/runtime/packed_func.h:1617
+      2: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      1: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      0: operator()
+            at ../src/runtime/c_runtime_api.cc:534
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
+        raise InstantiationError("Skipped because of invalid gpu kernel")
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 1, 128]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 8]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6100591
+    No: 5   GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
+        func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
+        func = build(s, args, target_host=task.target_host, runtime=runtime)
+      File "/workspace/python/tvm/driver/build_module.py", line 227, in build
+        input_mod = lower(inputs, args, name=name, binds=binds)
+      File "/workspace/python/tvm/driver/build_module.py", line 134, in lower
+        return ffi.lower_schedule(inp, args, name, binds, simple_mode)
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 331, in tvm._ffi._cy3.core.PackedFuncBase.__call__
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 276, in tvm._ffi._cy3.core.FuncCall
+      File "tvm/_ffi/_cython/./base.pxi", line 181, in tvm._ffi._cy3.core.CHECK_CALL
+    tvm._ffi.base.TVMError: Traceback (most recent call last):
+      24: TVMFuncCall
+            at ../src/runtime/c_runtime_api.cc:477
+      23: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      22: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      21: operator()
+            at ../include/tvm/runtime/packed_func.h:1730
+      20: unpack_call<tvm::IRModule, 5, tvm::<lambda(tvm::te::Schedule, const tvm::runtime::Array<tvm::runtime::ObjectRef>&, const tvm::runtime::String&, const tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer>&, bool)> >
+            at ../include/tvm/runtime/packed_func.h:1670
+      19: run<>
+            at ../include/tvm/runtime/packed_func.h:1630
+      18: run<tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      17: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      16: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      15: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      14: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1645
+      13: operator()
+            at ../src/driver/driver_api.cc:395
+      12: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, tvm::GlobalVarSupply, bool)
+            at ../src/driver/driver_api.cc:381
+      11: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
+            at ../src/driver/driver_api.cc:276
+      10: tvm::transform::Pass::operator()(tvm::IRModule) const
+            at ../src/ir/transform.cc:258
+      9: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      8: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:454
+      7: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      6: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/tir/ir/transform.cc:100
+      5: tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::operator()(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext) const
+            at ../include/tvm/runtime/packed_func.h:1749
+      4: tvm::tir::PrimFunc tvm::runtime::detail::typed_packed_call_dispatcher<tvm::tir::PrimFunc>::run<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::runtime::PackedFunc const&, tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&)
+            at ../include/tvm/runtime/packed_func.h:1693
+      3: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&) const
+            at ../include/tvm/runtime/packed_func.h:1617
+      2: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      1: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      0: operator()
+            at ../src/runtime/c_runtime_api.cc:534
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
+        raise InstantiationError("Skipped because of invalid gpu kernel")
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel
+
+    Traceback (most recent call last):
+      24: TVMFuncCall
+            at ../src/runtime/c_runtime_api.cc:477
+      23: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      22: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      21: operator()
+            at ../include/tvm/runtime/packed_func.h:1730
+      20: unpack_call<tvm::IRModule, 5, tvm::<lambda(tvm::te::Schedule, const tvm::runtime::Array<tvm::runtime::ObjectRef>&, const tvm::runtime::String&, const tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer>&, bool)> >
+            at ../include/tvm/runtime/packed_func.h:1670
+      19: run<>
+            at ../include/tvm/runtime/packed_func.h:1630
+      18: run<tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      17: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      16: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      15: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      14: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1645
+      13: operator()
+            at ../src/driver/driver_api.cc:395
+      12: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, tvm::GlobalVarSupply, bool)
+            at ../src/driver/driver_api.cc:381
+      11: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
+            at ../src/driver/driver_api.cc:276
+      10: tvm::transform::Pass::operator()(tvm::IRModule) const
+            at ../src/ir/transform.cc:258
+      9: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      8: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:454
+      7: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      6: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/tir/ir/transform.cc:100
+      5: tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::operator()(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext) const
+            at ../include/tvm/runtime/packed_func.h:1749
+      4: tvm::tir::PrimFunc tvm::runtime::detail::typed_packed_call_dispatcher<tvm::tir::PrimFunc>::run<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::runtime::PackedFunc const&, tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&)
+            at ../include/tvm/runtime/packed_func.h:1693
+      3: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&) const
+            at ../include/tvm/runtime/packed_func.h:1617
+      2: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      1: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      0: operator()
+            at ../src/runtime/c_runtime_api.cc:534
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
+        raise InstantiationError("Skipped because of invalid gpu kernel")
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 64, 1, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 32]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3826906
+    No: 6   GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
+        func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
+        func = build(s, args, target_host=task.target_host, runtime=runtime)
+      File "/workspace/python/tvm/driver/build_module.py", line 227, in build
+        input_mod = lower(inputs, args, name=name, binds=binds)
+      File "/workspace/python/tvm/driver/build_module.py", line 134, in lower
+        return ffi.lower_schedule(inp, args, name, binds, simple_mode)
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 331, in tvm._ffi._cy3.core.PackedFuncBase.__call__
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 276, in tvm._ffi._cy3.core.FuncCall
+      File "tvm/_ffi/_cython/./base.pxi", line 181, in tvm._ffi._cy3.core.CHECK_CALL
+    tvm._ffi.base.TVMError: Traceback (most recent call last):
+      24: TVMFuncCall
+            at ../src/runtime/c_runtime_api.cc:477
+      23: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      22: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      21: operator()
+            at ../include/tvm/runtime/packed_func.h:1730
+      20: unpack_call<tvm::IRModule, 5, tvm::<lambda(tvm::te::Schedule, const tvm::runtime::Array<tvm::runtime::ObjectRef>&, const tvm::runtime::String&, const tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer>&, bool)> >
+            at ../include/tvm/runtime/packed_func.h:1670
+      19: run<>
+            at ../include/tvm/runtime/packed_func.h:1630
+      18: run<tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      17: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      16: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      15: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      14: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1645
+      13: operator()
+            at ../src/driver/driver_api.cc:395
+      12: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, tvm::GlobalVarSupply, bool)
+            at ../src/driver/driver_api.cc:381
+      11: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
+            at ../src/driver/driver_api.cc:276
+      10: tvm::transform::Pass::operator()(tvm::IRModule) const
+            at ../src/ir/transform.cc:258
+      9: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      8: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:454
+      7: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      6: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/tir/ir/transform.cc:100
+      5: tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::operator()(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext) const
+            at ../include/tvm/runtime/packed_func.h:1749
+      4: tvm::tir::PrimFunc tvm::runtime::detail::typed_packed_call_dispatcher<tvm::tir::PrimFunc>::run<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::runtime::PackedFunc const&, tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&)
+            at ../include/tvm/runtime/packed_func.h:1693
+      3: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&) const
+            at ../include/tvm/runtime/packed_func.h:1617
+      2: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      1: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      0: operator()
+            at ../src/runtime/c_runtime_api.cc:534
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
+        raise InstantiationError("Skipped because of invalid gpu kernel")
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel
+
+    Traceback (most recent call last):
+      24: TVMFuncCall
+            at ../src/runtime/c_runtime_api.cc:477
+      23: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      22: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      21: operator()
+            at ../include/tvm/runtime/packed_func.h:1730
+      20: unpack_call<tvm::IRModule, 5, tvm::<lambda(tvm::te::Schedule, const tvm::runtime::Array<tvm::runtime::ObjectRef>&, const tvm::runtime::String&, const tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer>&, bool)> >
+            at ../include/tvm/runtime/packed_func.h:1670
+      19: run<>
+            at ../include/tvm/runtime/packed_func.h:1630
+      18: run<tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      17: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      16: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      15: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      14: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1645
+      13: operator()
+            at ../src/driver/driver_api.cc:395
+      12: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, tvm::GlobalVarSupply, bool)
+            at ../src/driver/driver_api.cc:381
+      11: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
+            at ../src/driver/driver_api.cc:276
+      10: tvm::transform::Pass::operator()(tvm::IRModule) const
+            at ../src/ir/transform.cc:258
+      9: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      8: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:454
+      7: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      6: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/tir/ir/transform.cc:100
+      5: tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::operator()(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext) const
+            at ../include/tvm/runtime/packed_func.h:1749
+      4: tvm::tir::PrimFunc tvm::runtime::detail::typed_packed_call_dispatcher<tvm::tir::PrimFunc>::run<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::runtime::PackedFunc const&, tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&)
+            at ../include/tvm/runtime/packed_func.h:1693
+      3: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&) const
+            at ../include/tvm/runtime/packed_func.h:1617
+      2: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      1: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      0: operator()
+            at ../src/runtime/c_runtime_api.cc:534
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
+        raise InstantiationError("Skipped because of invalid gpu kernel")
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 1, 512]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 16]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9029239
+    No: 7   GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
+        func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
+        func = build(s, args, target_host=task.target_host, runtime=runtime)
+      File "/workspace/python/tvm/driver/build_module.py", line 227, in build
+        input_mod = lower(inputs, args, name=name, binds=binds)
+      File "/workspace/python/tvm/driver/build_module.py", line 134, in lower
+        return ffi.lower_schedule(inp, args, name, binds, simple_mode)
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 331, in tvm._ffi._cy3.core.PackedFuncBase.__call__
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 276, in tvm._ffi._cy3.core.FuncCall
+      File "tvm/_ffi/_cython/./base.pxi", line 181, in tvm._ffi._cy3.core.CHECK_CALL
+    tvm._ffi.base.TVMError: Traceback (most recent call last):
+      24: TVMFuncCall
+            at ../src/runtime/c_runtime_api.cc:477
+      23: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      22: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      21: operator()
+            at ../include/tvm/runtime/packed_func.h:1730
+      20: unpack_call<tvm::IRModule, 5, tvm::<lambda(tvm::te::Schedule, const tvm::runtime::Array<tvm::runtime::ObjectRef>&, const tvm::runtime::String&, const tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer>&, bool)> >
+            at ../include/tvm/runtime/packed_func.h:1670
+      19: run<>
+            at ../include/tvm/runtime/packed_func.h:1630
+      18: run<tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      17: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      16: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      15: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      14: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1645
+      13: operator()
+            at ../src/driver/driver_api.cc:395
+      12: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, tvm::GlobalVarSupply, bool)
+            at ../src/driver/driver_api.cc:381
+      11: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
+            at ../src/driver/driver_api.cc:276
+      10: tvm::transform::Pass::operator()(tvm::IRModule) const
+            at ../src/ir/transform.cc:258
+      9: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      8: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:454
+      7: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      6: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/tir/ir/transform.cc:100
+      5: tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::operator()(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext) const
+            at ../include/tvm/runtime/packed_func.h:1749
+      4: tvm::tir::PrimFunc tvm::runtime::detail::typed_packed_call_dispatcher<tvm::tir::PrimFunc>::run<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::runtime::PackedFunc const&, tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&)
+            at ../include/tvm/runtime/packed_func.h:1693
+      3: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&) const
+            at ../include/tvm/runtime/packed_func.h:1617
+      2: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      1: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      0: operator()
+            at ../src/runtime/c_runtime_api.cc:534
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
+        raise InstantiationError("Skipped because of invalid gpu kernel")
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel
+
+    Traceback (most recent call last):
+      24: TVMFuncCall
+            at ../src/runtime/c_runtime_api.cc:477
+      23: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      22: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      21: operator()
+            at ../include/tvm/runtime/packed_func.h:1730
+      20: unpack_call<tvm::IRModule, 5, tvm::<lambda(tvm::te::Schedule, const tvm::runtime::Array<tvm::runtime::ObjectRef>&, const tvm::runtime::String&, const tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer>&, bool)> >
+            at ../include/tvm/runtime/packed_func.h:1670
+      19: run<>
+            at ../include/tvm/runtime/packed_func.h:1630
+      18: run<tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      17: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      16: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      15: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1630
+      14: run<tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_, tvm::runtime::TVMMovableArgValueWithContext_>
+            at ../include/tvm/runtime/packed_func.h:1645
+      13: operator()
+            at ../src/driver/driver_api.cc:395
+      12: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, tvm::GlobalVarSupply, bool)
+            at ../src/driver/driver_api.cc:381
+      11: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
+            at ../src/driver/driver_api.cc:276
+      10: tvm::transform::Pass::operator()(tvm::IRModule) const
+            at ../src/ir/transform.cc:258
+      9: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      8: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:454
+      7: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/ir/transform.cc:274
+      6: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
+            at ../src/tir/ir/transform.cc:100
+      5: tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::operator()(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext) const
+            at ../include/tvm/runtime/packed_func.h:1749
+      4: tvm::tir::PrimFunc tvm::runtime::detail::typed_packed_call_dispatcher<tvm::tir::PrimFunc>::run<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::runtime::PackedFunc const&, tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&)
+            at ../include/tvm/runtime/packed_func.h:1693
+      3: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext>(tvm::tir::PrimFunc&&, tvm::IRModule&&, tvm::transform::PassContext&&) const
+            at ../include/tvm/runtime/packed_func.h:1617
+      2: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
+            at ../include/tvm/runtime/packed_func.h:1217
+      1: Call
+            at ../include/tvm/runtime/packed_func.h:1213
+      0: operator()
+            at ../src/runtime/c_runtime_api.cc:534
+      File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
+        raise InstantiationError("Skipped because of invalid gpu kernel")
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 32, 1, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 32, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,3183625
+    No: 8   GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
@@ -511,8 +1130,8 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 64, 4, 2]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 256]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9285838
-    No: 4   GFLOPS: 0.00/6.36       result: Traceback (most recent call last):
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 8, 64]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 16, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7371529
+    No: 9   GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
@@ -634,8 +1253,8 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 1, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9778747
-    No: 5   GFLOPS: 0.00/6.36       result: Traceback (most recent call last):
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 32, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 8, 16]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6714222
+    No: 10  GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
@@ -757,8 +1376,8 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 2, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 256, 2]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,10132332
-    No: 6   GFLOPS: 0.00/6.36       result: Traceback (most recent call last):
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 2, 16]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9279551
+    No: 11  GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
@@ -880,8 +1499,26 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 64, 1, 2]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6851081
-    No: 7   GFLOPS: 0.00/6.36       result: Traceback (most recent call last):
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 64, 1, 2]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 64, 2]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,5284901
+    No: 12  GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
+      File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 142, in build
+        res = future.result()
+      File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
+        return self.__get_result()
+      File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
+        raise self._exception
+      File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
+        result = self.fn(*self.args, **self.kwargs)
+      File "/workspace/python/tvm/contrib/popen_pool.py", line 432, in <lambda>
+        worker = lambda *args: self._worker_run(*args)
+      File "/workspace/python/tvm/contrib/popen_pool.py", line 401, in _worker_run
+        return proc.recv()
+      File "/workspace/python/tvm/contrib/popen_pool.py", line 309, in recv
+        raise TimeoutError()
+    TimeoutError
+
+            [('tile_f', [-1, 16, 1, 1]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 32, 4]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,8797804
+    No: 13  GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
@@ -1003,10 +1640,8 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 128, 2]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,8593957
-    No: 8   GFLOPS: 6.86/6.86       result: MeasureResult(costs=(0.03372837425,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.8615002632141113, timestamp=1673056550.087199)       [('tile_f', [-1, 32, 16, 1]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 1, 8]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,5325359
-    No: 9   GFLOPS: 4.86/6.86       result: MeasureResult(costs=(0.047594440249999995,), error_no=MeasureErrorNo.NO_ERROR, all_cost=3.688849925994873, timestamp=1673056553.9651906)        [('tile_f', [-1, 8, 1, 1]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 2, 2]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,1782223
-    No: 10  GFLOPS: 0.00/6.86       result: Traceback (most recent call last):
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 8, 8]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 32]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4597274
+    No: 14  GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
@@ -1128,9 +1763,8 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 64, 4, 2]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 16]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7092878
-    No: 11  GFLOPS: 260.74/260.74   result: MeasureResult(costs=(0.0008878736495726496,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.052375316619873, timestamp=1673056554.7084162)       [('tile_f', [-1, 2, 2, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 8]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7262309
-    No: 12  GFLOPS: 0.00/260.74     result: Traceback (most recent call last):
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 128, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 128]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,1537190
+    No: 15  GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
@@ -1252,9 +1886,8 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 1, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 32, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,8052185
-    No: 13  GFLOPS: 49.29/260.74    result: MeasureResult(costs=(0.004696770666666667,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.223013162612915, timestamp=1673056557.2106256)        [('tile_f', [-1, 16, 1, 2]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2226899
-    No: 14  GFLOPS: 0.00/260.74     result: Traceback (most recent call last):
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 64, 1, 2]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 4, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,5138821
+    No: 16  GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
@@ -1376,9 +2009,9 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 4, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 64, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9317337
-    No: 15  GFLOPS: 0.96/260.74     result: MeasureResult(costs=(0.2407640685,), error_no=MeasureErrorNo.NO_ERROR, all_cost=4.90652060508728, timestamp=1673056560.7619154) [('tile_f', [-1, 8, 1, 64]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 16, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,1756683
-    No: 16  GFLOPS: 0.00/260.74     result: Traceback (most recent call last):
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 2, 8]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 8, 32]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2282863
+    No: 17  GFLOPS: 8.08/18.74      result: MeasureResult(costs=(0.02864277625,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.394887685775757, timestamp=1673058819.6129212)       [('tile_f', [-1, 2, 4, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 8, 8]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2430456
+    No: 18  GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
@@ -1500,8 +2133,8 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 32, 8, 1]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 128, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9000232
-    No: 17  GFLOPS: 0.00/260.74     result: Traceback (most recent call last):
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 8, 8, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 16]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6125490
+    No: 19  GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
@@ -1623,9 +2256,8 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 4, 64]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 2, 32]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7504627
-    No: 18  GFLOPS: 39.89/260.74    result: MeasureResult(costs=(0.0058029208333333325,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.8205113410949707, timestamp=1673056562.7755) [('tile_f', [-1, 8, 16, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 2]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4683049
-    No: 19  GFLOPS: 0.00/260.74     result: Traceback (most recent call last):
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 32, 16, 1]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 512, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4097759
+    No: 20  GFLOPS: 0.00/18.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 592, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 544, in _build_func_common
@@ -1747,18 +2379,17 @@ for this template
       File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 875, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
-    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 256, 2]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 2, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3876939
-    No: 20  GFLOPS: 293.48/293.48   result: MeasureResult(costs=(0.0007888221417322836,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.0552170276641846, timestamp=1673056563.5099218)      [('tile_f', [-1, 2, 16, 2]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 2]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4296686
+    tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 1, 256]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 8, 2]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7404097
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 219-221
+.. GENERATED FROM PYTHON SOURCE LINES 220-222
 
 Finally we can inspect the best config from log file, check correctness,
 and measure running time.
 
-.. GENERATED FROM PYTHON SOURCE LINES 221-251
+.. GENERATED FROM PYTHON SOURCE LINES 222-252
 
 .. code-block:: default
 
@@ -1803,9 +2434,9 @@ and measure running time.
     Finish loading 20 records
 
     Best config:
-    [('tile_f', [-1, 2, 16, 2]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 2]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4296686
+    [('tile_f', [-1, 4, 1, 32]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,5303507
     Finish loading 20 records
-    Time cost of this operator: 0.001189
+    Time cost of this operator: 0.011815
 
 
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_relay_arm.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_relay_arm.rst.txt
index c97c86f535..8fd682538f 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_relay_arm.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_relay_arm.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/tune_with_autotvm/tune_relay_arm.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_tune_with_autotvm_tune_relay_arm.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_tune_with_autotvm_tune_relay_arm.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/bc33c0d33026b287306b6ead1a50b04a/tune_relay_arm.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_relay_cuda.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_relay_cuda.rst.txt
index b676946b63..ff19e65ca9 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_relay_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_relay_cuda.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/tune_with_autotvm/tune_relay_cuda.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_tune_with_autotvm_tune_relay_cuda.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_tune_with_autotvm_tune_relay_cuda.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/d1434e80dd27eef6b1c9cbaa13f1197b/tune_relay_cuda.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -85,7 +89,7 @@ Now return to python code. Import packages.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 79-84
+.. GENERATED FROM PYTHON SOURCE LINES 80-85
 
 Define Network
 --------------
@@ -93,7 +97,7 @@ First we need to define the network in relay frontend API.
 We can load some pre-defined network from :code:`tvm.relay.testing`.
 We can also load models from MXNet, ONNX and TensorFlow.
 
-.. GENERATED FROM PYTHON SOURCE LINES 84-127
+.. GENERATED FROM PYTHON SOURCE LINES 85-128
 
 .. code-block:: default
 
@@ -147,13 +151,13 @@ We can also load models from MXNet, ONNX and TensorFlow.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 128-131
+.. GENERATED FROM PYTHON SOURCE LINES 129-132
 
 Set Tuning Options
 ------------------
 Before tuning, we apply some configurations.
 
-.. GENERATED FROM PYTHON SOURCE LINES 131-151
+.. GENERATED FROM PYTHON SOURCE LINES 132-152
 
 .. code-block:: default
 
@@ -191,7 +195,7 @@ Before tuning, we apply some configurations.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 152-162
+.. GENERATED FROM PYTHON SOURCE LINES 153-163
 
 .. note:: How to set tuning options
 
@@ -204,7 +208,7 @@ Before tuning, we apply some configurations.
   accelerate the tuning process. (see the 'Scale up measurement` section below).
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 165-171
+.. GENERATED FROM PYTHON SOURCE LINES 166-172
 
 Begin Tuning
 ------------
@@ -213,7 +217,7 @@ Here, we provide a simple utility function to tune a list of tasks.
 This function is just an initial implementation which tunes them in sequential order.
 We will introduce a more sophisticated tuning scheduler in the future.
 
-.. GENERATED FROM PYTHON SOURCE LINES 171-223
+.. GENERATED FROM PYTHON SOURCE LINES 172-224
 
 .. code-block:: default
 
@@ -276,11 +280,11 @@ We will introduce a more sophisticated tuning scheduler in the future.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 224-225
+.. GENERATED FROM PYTHON SOURCE LINES 225-226
 
 Finally, we launch tuning jobs and evaluate the end-to-end performance.
 
-.. GENERATED FROM PYTHON SOURCE LINES 225-261
+.. GENERATED FROM PYTHON SOURCE LINES 226-262
 
 .. code-block:: default
 
@@ -327,7 +331,7 @@ Finally, we launch tuning jobs and evaluate the end-to-end performance.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 262-295
+.. GENERATED FROM PYTHON SOURCE LINES 263-296
 
 Sample Output
 -------------
@@ -363,7 +367,7 @@ The tuning target is NVIDIA 1080 Ti.
 
 As a reference baseline, the time cost of MXNet + TensorRT on resnet-18 is 1.30ms. So we are a little faster.
 
-.. GENERATED FROM PYTHON SOURCE LINES 297-313
+.. GENERATED FROM PYTHON SOURCE LINES 298-314
 
 .. note:: **Experiencing Difficulties?**
 
@@ -382,11 +386,11 @@ As a reference baseline, the time cost of MXNet + TensorRT on resnet-18 is 1.30m
 
   Finally, always feel free to ask our community for help on https://discuss.tvm.apache.org
 
-.. GENERATED FROM PYTHON SOURCE LINES 316-317
+.. GENERATED FROM PYTHON SOURCE LINES 317-318
 
 .. _tutorials-autotvm-scale-up-rpc-tracker:
 
-.. GENERATED FROM PYTHON SOURCE LINES 319-372
+.. GENERATED FROM PYTHON SOURCE LINES 320-373
 
 Scale up measurement by using multiple devices
 ----------------------------------------------
@@ -442,7 +446,7 @@ For example, if we have four 1080ti, two titanx and one gfx900, the output can b
 Finally, we need to change the tuning option to use RPCRunner. Use the code below
 to replace the corresponding part above.
 
-.. GENERATED FROM PYTHON SOURCE LINES 372-391
+.. GENERATED FROM PYTHON SOURCE LINES 373-392
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_relay_mobile_gpu.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_relay_mobile_gpu.rst.txt
index e2603afca0..1bc903af55 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_relay_mobile_gpu.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_relay_mobile_gpu.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/tune_with_autotvm/tune_relay_mobile_gpu.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_tune_with_autotvm_tune_relay_mobile_gpu.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_tune_with_autotvm_tune_relay_mobile_gpu.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/6b0f549107f73f2e48c894372be08bcb/tune_relay_mobile_gpu.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_relay_x86.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_relay_x86.rst.txt
index bbb430b5bd..8567f5ca7b 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_relay_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_relay_x86.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/tune_with_autotvm/tune_relay_x86.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_tune_with_autotvm_tune_relay_x86.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_tune_with_autotvm_tune_relay_x86.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/910e6ecee4ecac8d8ca0baeb6d00689d/tune_relay_x86.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_aot.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_aot.rst.txt
index 48e18944e6..b6b9ba2956 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_aot.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_aot.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/work_with_microtvm/micro_aot.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_aot.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_aot.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/c00933f3fbcf90c4f584d54607b33805/micro_aot.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -32,16 +36,55 @@ compared to GraphExecutor. Also, we can have better memory management using ahea
 of time compilation. This tutorial can be executed on a x86 CPU using C runtime (CRT)
 or on Zephyr platform on a microcontroller/board supported by Zephyr.
 
-.. GENERATED FROM PYTHON SOURCE LINES 32-44
+.. GENERATED FROM PYTHON SOURCE LINES 34-36
+
+.. include:: ../../../../gallery/how_to/work_with_microtvm/install_dependencies.rst
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 37-47
 
 .. code-block:: default
 
 
 
+    import os
+
+    # By default, this tutorial runs on x86 CPU using TVM's C runtime. If you would like
+    # to run on real Zephyr hardware, you must export the `TVM_MICRO_USE_HW` environment
+    # variable. Otherwise (if you are using the C runtime), you can skip installing
+    # Zephyr and CMSIS-NN. It takes ~20 minutes to install both of them.
+    use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 53-55
+
+.. include:: ../../../../gallery/how_to/work_with_microtvm/install_zephyr.rst
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 58-60
+
+.. include:: ../../../../gallery/how_to/work_with_microtvm/install_cmsis.rst
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 63-66
+
+Import Python dependencies
+-------------------------------
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 66-75
+
+.. code-block:: default
+
     import numpy as np
     import pathlib
     import json
-    import os
 
     import tvm
     from tvm import relay
@@ -55,7 +98,7 @@ or on Zephyr platform on a microcontroller/board supported by Zephyr.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 50-60
+.. GENERATED FROM PYTHON SOURCE LINES 76-86
 
 Import a TFLite model
 ---------------------
@@ -68,11 +111,10 @@ To test this model, we use samples from `KWS dataset provided by Google <https:/
 you need to export `TVM_MICRO_USE_HW` environment variable.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 60-82
+.. GENERATED FROM PYTHON SOURCE LINES 86-107
 
 .. code-block:: default
 
-    use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
     MODEL_URL = "https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/model/keyword_spotting_quant.tflite"
     MODEL_PATH = download_testdata(MODEL_URL, "keyword_spotting_quant.tflite", module="model")
     SAMPLE_URL = "https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/data/keyword_spotting_int8_6.pyc.npy"
@@ -101,7 +143,7 @@ you need to export `TVM_MICRO_USE_HW` environment variable.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 83-93
+.. GENERATED FROM PYTHON SOURCE LINES 108-118
 
 Defining the target
 -------------------
@@ -114,7 +156,7 @@ board (E.g. nucleo_l4r5zi) and pass it to `tvm.target.target.micro` to create a
 micro target.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 93-112
+.. GENERATED FROM PYTHON SOURCE LINES 118-137
 
 .. code-block:: default
 
@@ -144,7 +186,7 @@ micro target.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 113-118
+.. GENERATED FROM PYTHON SOURCE LINES 138-143
 
 Compile the model
 -----------------
@@ -152,7 +194,7 @@ Compile the model
 Now, we compile the model for the target:
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 118-123
+.. GENERATED FROM PYTHON SOURCE LINES 143-148
 
 .. code-block:: default
 
@@ -168,7 +210,7 @@ Now, we compile the model for the target:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 124-132
+.. GENERATED FROM PYTHON SOURCE LINES 149-157
 
 Create a microTVM project
 -------------------------
@@ -179,7 +221,7 @@ CRT and Zephyr microTVM template projects which are used for x86 CPU and Zephyr
 respectively.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 132-150
+.. GENERATED FROM PYTHON SOURCE LINES 157-177
 
 .. code-block:: default
 
@@ -193,6 +235,8 @@ respectively.
             "board": BOARD,
             "serial_number": SERIAL,
             "config_main_stack_size": 4096,
+            "cmsis_path": os.getenv("CMSIS_PATH", default="/content/cmsis"),
+            "zephyr_base": os.getenv("ZEPHYR_BASE", default="/content/zephyrproject/zephyr"),
         }
 
     temp_dir = tvm.contrib.utils.tempdir()
@@ -208,7 +252,7 @@ respectively.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 151-159
+.. GENERATED FROM PYTHON SOURCE LINES 178-186
 
 Build, flash and execute the model
 ----------------------------------
@@ -219,7 +263,7 @@ Next, we define the labels for the model output and execute the model with a
 sample with expected value of 6 (label: left).
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 159-183
+.. GENERATED FROM PYTHON SOURCE LINES 186-210
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
index 04aa5d5819..ddfb05e6d6 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/work_with_microtvm/micro_autotune.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_autotune.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_autotune.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/f83ba3df2d52f9b54cf141114359481a/micro_autotune.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -28,13 +32,50 @@ Autotuning with microTVM
 
 This tutorial explains how to autotune a model using the C runtime.
 
-.. GENERATED FROM PYTHON SOURCE LINES 29-41
+.. GENERATED FROM PYTHON SOURCE LINES 31-33
+
+.. include:: ../../../../gallery/how_to/work_with_microtvm/install_dependencies.rst
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 34-42
 
 .. code-block:: default
 
 
 
+    # You can skip the following two sections (installing Zephyr and CMSIS-NN) if the following flag is False.
+    # Installing Zephyr takes ~20 min.
     import os
+
+    use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 48-50
+
+.. include:: ../../../../gallery/how_to/work_with_microtvm/install_zephyr.rst
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 53-55
+
+.. include:: ../../../../gallery/how_to/work_with_microtvm/install_cmsis.rst
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 58-61
+
+Import Python dependencies
+-------------------------------
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 61-68
+
+.. code-block:: default
+
     import json
     import numpy as np
     import pathlib
@@ -42,8 +83,6 @@ This tutorial explains how to autotune a model using the C runtime.
     import tvm
     from tvm.relay.backend import Runtime
 
-    use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
-
 
 
 
@@ -51,7 +90,7 @@ This tutorial explains how to autotune a model using the C runtime.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 47-53
+.. GENERATED FROM PYTHON SOURCE LINES 69-75
 
 Defining the model
 ###################
@@ -60,7 +99,7 @@ Defining the model
  fill parameters with random numbers.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 53-78
+.. GENERATED FROM PYTHON SOURCE LINES 75-100
 
 .. code-block:: default
 
@@ -96,7 +135,7 @@ Defining the model
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 79-90
+.. GENERATED FROM PYTHON SOURCE LINES 101-112
 
 Defining the target
 ######################
@@ -110,7 +149,7 @@ Defining the target
  this tutorial.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 90-108
+.. GENERATED FROM PYTHON SOURCE LINES 112-130
 
 .. code-block:: default
 
@@ -139,7 +178,7 @@ Defining the target
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 109-118
+.. GENERATED FROM PYTHON SOURCE LINES 131-140
 
 Extracting tuning tasks
 ########################
@@ -151,7 +190,7 @@ Extracting tuning tasks
  transformation passes; we'll apply the same configuration later on during autotuning.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 118-124
+.. GENERATED FROM PYTHON SOURCE LINES 140-146
 
 .. code-block:: default
 
@@ -168,7 +207,7 @@ Extracting tuning tasks
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 125-135
+.. GENERATED FROM PYTHON SOURCE LINES 147-157
 
 Configuring microTVM
 #####################
@@ -181,7 +220,7 @@ Configuring microTVM
  choose other options by choosing from `PLATFORM` list.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 135-173
+.. GENERATED FROM PYTHON SOURCE LINES 157-195
 
 .. code-block:: default
 
@@ -230,14 +269,14 @@ Configuring microTVM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 174-178
+.. GENERATED FROM PYTHON SOURCE LINES 196-200
 
 Run Autotuning
 #########################
  Now we can run autotuning separately on each extracted task on microTVM device.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 178-196
+.. GENERATED FROM PYTHON SOURCE LINES 200-218
 
 .. code-block:: default
 
@@ -266,7 +305,7 @@ Run Autotuning
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 197-203
+.. GENERATED FROM PYTHON SOURCE LINES 219-225
 
 Timing the untuned program
 ###########################
@@ -275,7 +314,7 @@ Timing the untuned program
  the tuned operator.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 203-242
+.. GENERATED FROM PYTHON SOURCE LINES 225-264
 
 .. code-block:: default
 
@@ -329,21 +368,21 @@ Timing the untuned program
     ########## Build without Autotuning ##########
     Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)  
     ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------  
-    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  311.0     98.732   (1, 2, 10, 10, 3)  2       1        [311.0]           
-    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.021     0.959    (1, 6, 10, 10)     1       1        [3.021]           
-    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.972     0.309    (1, 1, 10, 10, 3)  1       1        [0.972]           
-    Total_time                                    -                                             314.993   -        -                  -       -        -                 
+    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  312.1     98.695   (1, 2, 10, 10, 3)  2       1        [312.1]           
+    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.148     0.995    (1, 6, 10, 10)     1       1        [3.148]           
+    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.978     0.309    (1, 1, 10, 10, 3)  1       1        [0.978]           
+    Total_time                                    -                                             316.226   -        -                  -       -        -                 
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 243-246
+.. GENERATED FROM PYTHON SOURCE LINES 265-268
 
 Timing the tuned program
 #########################
  Once autotuning completes, you can time execution of the entire program using the Debug Runtime:
 
-.. GENERATED FROM PYTHON SOURCE LINES 246-285
+.. GENERATED FROM PYTHON SOURCE LINES 268-307
 
 .. code-block:: default
 
@@ -397,10 +436,10 @@ Timing the tuned program
     ########## Build with Autotuning ##########
     Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)  
     ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------  
-    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  137.6     98.091   (1, 6, 10, 10, 1)  2       1        [137.6]           
-    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.821     1.298    (1, 6, 10, 10)     1       1        [1.821]           
-    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.858     0.611    (1, 3, 10, 10, 1)  1       1        [0.858]           
-    Total_time                                    -                                             140.278   -        -                  -       -        -                 
+    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  103.3     97.519   (1, 6, 10, 10, 1)  2       1        [103.3]           
+    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.775     1.676    (1, 6, 10, 10)     1       1        [1.775]           
+    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.853     0.805    (1, 3, 10, 10, 1)  1       1        [0.853]           
+    Total_time                                    -                                             105.928   -        -                  -       -        -                 
 
 
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_ethosu.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_ethosu.rst.txt
index 067102e395..386a10d703 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_ethosu.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_ethosu.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/work_with_microtvm/micro_ethosu.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_ethosu.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_ethosu.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/55a9eff88b1303e525d53269eeb16897/micro_ethosu.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_pytorch.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_pytorch.rst.txt
index 147bffb92f..616f354c90 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_pytorch.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_pytorch.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/work_with_microtvm/micro_pytorch.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_pytorch.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_pytorch.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/09df7d9b9c90a2a1bdd570520693fd9f/micro_pytorch.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -31,14 +35,18 @@ a PyTorch model. This tutorial can be executed on a x86 CPU using C runtime (CRT
 **Note:** This tutorial only runs on x86 CPU using CRT and does not run on Zephyr
 since the model would not fit on our current supported Zephyr boards.
 
-.. GENERATED FROM PYTHON SOURCE LINES 31-46
+.. GENERATED FROM PYTHON SOURCE LINES 33-35
+
+.. include:: ../../../../gallery/how_to/work_with_microtvm/install_dependencies.rst
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 36-50
 
 .. code-block:: default
 
 
 
     import pathlib
-
     import torch
     import torchvision
     from torchvision import transforms
@@ -57,7 +65,7 @@ since the model would not fit on our current supported Zephyr boards.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 52-58
+.. GENERATED FROM PYTHON SOURCE LINES 56-62
 
 Load a pre-trained PyTorch model
 --------------------------------
@@ -66,7 +74,7 @@ To begin with, load pre-trained MobileNetV2 from torchvision. Then,
 download a cat image and preprocess it to use as the model input.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 58-86
+.. GENERATED FROM PYTHON SOURCE LINES 62-90
 
 .. code-block:: default
 
@@ -109,7 +117,7 @@ download a cat image and preprocess it to use as the model input.
     /venv/apache-tvm-py3.7/lib/python3.7/site-packages/torch/ao/quantization/utils.py:281: UserWarning: must run observer before calling calculate_qparams. Returning default values.
       "must run observer before calling calculate_qparams. " +
     Downloading: "https://download.pytorch.org/models/quantized/mobilenet_v2_qnnpack_37f702c5.pth" to /workspace/.cache/torch/hub/checkpoints/mobilenet_v2_qnnpack_37f702c5.pth
-
      0%|          | 0.00/3.42M [00:00<?, ?B/s]
    100%|##########| 3.42M/3.42M [00:00<00:00, 68.7MB/s]
+
      0%|          | 0.00/3.42M [00:00<?, ?B/s]
    100%|##########| 3.42M/3.42M [00:00<00:00, 153MB/s]
     /workspace/python/tvm/relay/frontend/pytorch_utils.py:47: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
       return LooseVersion(torch_ver) > ver
     /venv/apache-tvm-py3.7/lib/python3.7/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
@@ -118,7 +126,7 @@ download a cat image and preprocess it to use as the model input.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 87-96
+.. GENERATED FROM PYTHON SOURCE LINES 91-100
 
 Define Target, Runtime and Executor
 -----------------------------------
@@ -130,7 +138,7 @@ for C runtime which can run on a x86 CPU machine with the same flow that
 would run on a physical microcontroller.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 96-109
+.. GENERATED FROM PYTHON SOURCE LINES 100-113
 
 .. code-block:: default
 
@@ -154,7 +162,7 @@ would run on a physical microcontroller.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 110-115
+.. GENERATED FROM PYTHON SOURCE LINES 114-119
 
 Compile the model
 ------------------
@@ -162,7 +170,7 @@ Compile the model
 Now, we compile the model for the target:
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 115-124
+.. GENERATED FROM PYTHON SOURCE LINES 119-128
 
 .. code-block:: default
 
@@ -182,7 +190,7 @@ Now, we compile the model for the target:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 125-131
+.. GENERATED FROM PYTHON SOURCE LINES 129-135
 
 Create a microTVM project
 -------------------------
@@ -191,7 +199,7 @@ Now that we have the compiled model as an IRModule, we need to create a firmware
 to use the compiled model with microTVM. To do this, we use Project API.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 131-143
+.. GENERATED FROM PYTHON SOURCE LINES 135-147
 
 .. code-block:: default
 
@@ -214,7 +222,7 @@ to use the compiled model with microTVM. To do this, we use Project API.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 144-150
+.. GENERATED FROM PYTHON SOURCE LINES 148-154
 
 Build, flash and execute the model
 ----------------------------------
@@ -223,7 +231,7 @@ physical microcontroller and it is skipped if it is simulating a microcontroller
 via the host `main.cc`` or if a Zephyr emulated board is selected as the target.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 150-161
+.. GENERATED FROM PYTHON SOURCE LINES 154-165
 
 .. code-block:: default
 
@@ -245,14 +253,14 @@ via the host `main.cc`` or if a Zephyr emulated board is selected as the target.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 162-166
+.. GENERATED FROM PYTHON SOURCE LINES 166-170
 
 Look up synset name
 -------------------
 Look up prediction top 1 index in 1000 class synset.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 166-207
+.. GENERATED FROM PYTHON SOURCE LINES 170-211
 
 .. code-block:: default
 
@@ -314,7 +322,7 @@ Look up prediction top 1 index in 1000 class synset.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  2.920 seconds)
+   **Total running time of the script:** ( 1 minutes  1.972 seconds)
 
 
 .. _sphx_glr_download_how_to_work_with_microtvm_micro_pytorch.py:
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_reference_vm.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_reference_vm.rst.txt
index f5d99a228d..277d3dc055 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_reference_vm.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_reference_vm.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/work_with_microtvm/micro_reference_vm.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_reference_vm.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_reference_vm.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/7ef06253b3d2676eb50e20a5f81ef8f9/micro_reference_vm.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_tflite.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_tflite.rst.txt
index cde4f85c34..4e5a7a2696 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_tflite.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_tflite.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/work_with_microtvm/micro_tflite.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_tflite.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_tflite.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/5b279d8a8718816263fa65b0eef1a5c0/micro_tflite.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -27,111 +31,52 @@ microTVM with TFLite Models
 This tutorial is an introduction to working with microTVM and a TFLite
 model with Relay.
 
-.. GENERATED FROM PYTHON SOURCE LINES 29-124
-
-.. note::
-    If you want to run this tutorial on the microTVM Reference VM, download the Jupyter
-    notebook using the link at the bottom of this page and save it into the TVM directory. Then:
-
-    #. Login to the reference VM with a modified ``vagrant ssh`` command:
-
-        ``$ vagrant ssh -- -L8888:localhost:8888``
-
-    #. Install jupyter:  ``pip install jupyterlab``
-    #. ``cd`` to the TVM directory.
-    #. Install tflite: poetry install -E importer-tflite
-    #. Launch Jupyter Notebook: ``jupyter notebook``
-    #. Copy the localhost URL displayed, and paste it into your browser.
-    #. Navigate to saved Jupyter Notebook (``.ipynb`` file).
-
-
-Setup
------
-
-Install TFLite
-^^^^^^^^^^^^^^
-
-To get started, TFLite package needs to be installed as prerequisite. You can do this in two ways:
-
-1. Install tflite with ``pip``
-
-    .. code-block:: bash
+.. GENERATED FROM PYTHON SOURCE LINES 29-31
 
-      pip install tflite=2.1.0 --user
+.. include:: ../../../../gallery/how_to/work_with_microtvm/install_dependencies.rst
 
-2. Generate the TFLite package yourself. The steps are the following:
 
-    Get the flatc compiler.
-    Please refer to https://github.com/google/flatbuffers for details
-    and make sure it is properly installed.
+.. GENERATED FROM PYTHON SOURCE LINES 32-42
 
-    .. code-block:: bash
-
-      flatc --version
-
-    Get the TFLite schema.
-
-    .. code-block:: bash
+.. code-block:: default
 
-      wget https://raw.githubusercontent.com/tensorflow/tensorflow/r1.13/tensorflow/lite/schema/schema.fbs
 
-    Generate TFLite package.
 
-    .. code-block:: bash
+    import os
 
-      flatc --python schema.fbs
+    # By default, this tutorial runs on x86 CPU using TVM's C runtime. If you would like
+    # to run on real Zephyr hardware, you must export the `TVM_MICRO_USE_HW` environment
+    # variable. Otherwise (if you are using the C runtime), you can skip installing
+    # Zephyr and CMSIS-NN. It takes ~20 minutes to install both of them.
+    use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
 
-    Add the current folder (which contains generated tflite module) to PYTHONPATH.
 
-    .. code-block:: bash
 
-      export PYTHONPATH=${PYTHONPATH:+$PYTHONPATH:}$(pwd)
 
-To validate that the TFLite package was installed successfully, ``python -c "import tflite"``
 
-Install Zephyr (physical hardware only)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-When running this tutorial with a host simulation (the default), you can use the host ``gcc`` to
-build a firmware image that simulates the device. When compiling to run on physical hardware, you
-need to install a *toolchain* plus some target-specific dependencies. microTVM allows you to
-supply any compiler and runtime that can launch the TVM RPC server, but to get started, this
-tutorial relies on the Zephyr RTOS to provide these pieces.
 
-You can install Zephyr by following the
-`Installation Instructions <https://docs.zephyrproject.org/latest/getting_started/index.html>`_.
 
-Aside: Recreating your own Pre-Trained TFLite model
- The tutorial downloads a pretrained TFLite model. When working with microcontrollers
- you need to be mindful these are highly resource constrained devices as such standard
- models like MobileNet may not fit into their modest memory.
+.. GENERATED FROM PYTHON SOURCE LINES 48-50
 
- For this tutorial, we'll make use of one of the TF Micro example models.
+.. include:: ../../../../gallery/how_to/work_with_microtvm/install_zephyr.rst
 
- If you wish to replicate the training steps see:
- https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/examples/hello_world/train
 
-   .. note::
+.. GENERATED FROM PYTHON SOURCE LINES 53-55
 
-     If you accidentally download the example pretrained model from:
+.. include:: ../../../../gallery/how_to/work_with_microtvm/install_cmsis.rst
 
-     ``wget https://storage.googleapis.com/download.tensorflow.org/models/tflite/micro/hello_world_2020_04_13.zip``
 
-     this will fail due to an unimplemented opcode (114)
+.. GENERATED FROM PYTHON SOURCE LINES 58-61
 
-Load and prepare the Pre-Trained Model
---------------------------------------
+Import Python dependencies
+-------------------------------
 
-Load the pretrained TFLite model from a file in your current
-directory into a buffer
 
-.. GENERATED FROM PYTHON SOURCE LINES 124-145
+.. GENERATED FROM PYTHON SOURCE LINES 61-78
 
 .. code-block:: default
 
-
-
-    import os
     import json
     import tarfile
     import pathlib
@@ -143,7 +88,6 @@ directory into a buffer
     import tvm.contrib.utils
     from tvm.contrib.download import download_testdata
 
-    use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
     model_url = "https://people.linaro.org/~tom.gall/sine_model.tflite"
     model_file = "sine_model.tflite"
     model_path = download_testdata(model_url, model_file, module="data")
@@ -157,11 +101,11 @@ directory into a buffer
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 151-152
+.. GENERATED FROM PYTHON SOURCE LINES 79-80
 
 Using the buffer, transform into a tflite model python object
 
-.. GENERATED FROM PYTHON SOURCE LINES 152-161
+.. GENERATED FROM PYTHON SOURCE LINES 80-89
 
 .. code-block:: default
 
@@ -181,11 +125,11 @@ Using the buffer, transform into a tflite model python object
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 162-163
+.. GENERATED FROM PYTHON SOURCE LINES 90-91
 
 Print out the version of the model
 
-.. GENERATED FROM PYTHON SOURCE LINES 163-166
+.. GENERATED FROM PYTHON SOURCE LINES 91-94
 
 .. code-block:: default
 
@@ -205,7 +149,7 @@ Print out the version of the model
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 167-175
+.. GENERATED FROM PYTHON SOURCE LINES 95-103
 
 Parse the python model object to convert it into a relay module
 and weights.
@@ -216,7 +160,7 @@ If you are unsure what that might be, this can be discovered by using
 the ``visualize.py`` script within the Tensorflow project.
 See `How do I inspect a .tflite file? <https://www.tensorflow.org/lite/guide/faq>`_
 
-.. GENERATED FROM PYTHON SOURCE LINES 175-184
+.. GENERATED FROM PYTHON SOURCE LINES 103-112
 
 .. code-block:: default
 
@@ -236,7 +180,7 @@ See `How do I inspect a .tflite file? <https://www.tensorflow.org/lite/guide/faq
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 185-194
+.. GENERATED FROM PYTHON SOURCE LINES 113-122
 
 Defining the target
 -------------------
@@ -248,7 +192,7 @@ TARGET, the C Runtime as the RUNTIME and a proper board/VM to run it (Zephyr wil
 QEMU VM based on BOARD. In the example below the x86 arch is selected and a x86 VM is picked up accordingly:
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 194-225
+.. GENERATED FROM PYTHON SOURCE LINES 122-152
 
 .. code-block:: default
 
@@ -268,8 +212,7 @@ QEMU VM based on BOARD. In the example below the x86 arch is selected and a x86
         boards_file = pathlib.Path(tvm.micro.get_microtvm_template_projects("zephyr")) / "boards.json"
         with open(boards_file) as f:
             boards = json.load(f)
-
-        BOARD = os.getenv("TVM_MICRO_BOARD", default="nucleo_f746zg")
+        BOARD = os.getenv("TVM_MICRO_BOARD", default="nucleo_l4r5zi")
         SERIAL = os.getenv("TVM_MICRO_SERIAL", default=None)
         TARGET = tvm.target.target.micro(boards[BOARD]["model"])
 
@@ -290,11 +233,11 @@ QEMU VM based on BOARD. In the example below the x86 arch is selected and a x86
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 226-227
+.. GENERATED FROM PYTHON SOURCE LINES 153-154
 
 Now, compile the model for the target:
 
-.. GENERATED FROM PYTHON SOURCE LINES 227-310
+.. GENERATED FROM PYTHON SOURCE LINES 154-244
 
 .. code-block:: default
 
@@ -366,7 +309,14 @@ Now, compile the model for the target:
 
     if use_physical_hw:
         template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects("zephyr"))
-        project_options = {"project_type": "host_driven", "board": BOARD, "serial_number": SERIAL}
+        project_options = {
+            "project_type": "host_driven",
+            "board": BOARD,
+            "serial_number": SERIAL,
+            "config_main_stack_size": 4096,
+            "cmsis_path": os.getenv("CMSIS_PATH", default="/content/cmsis"),
+            "zephyr_base": os.getenv("ZEPHYR_BASE", default="/content/zephyrproject/zephyr"),
+        }
 
     # Create a temporary directory
 
@@ -418,14 +368,14 @@ Now, compile the model for the target:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 311-315
+.. GENERATED FROM PYTHON SOURCE LINES 245-249
 
 Next, establish a session with the simulated device and run the
 computation. The `with session` line would typically flash an attached
 microcontroller, but in this tutorial, it simply launches a subprocess
 to stand in for an attached microcontroller.
 
-.. GENERATED FROM PYTHON SOURCE LINES 315-332
+.. GENERATED FROM PYTHON SOURCE LINES 249-266
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
index f72d49d8df..9c91db6b9b 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/work_with_microtvm/micro_train.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_train.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_train.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/a7c7ea4b5017ae70db1f51dd8e6dcd82/micro_train.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -28,18 +32,7 @@ This tutorial shows how MobileNetV1 models can be trained
 to fit on embedded devices, and how those models can be
 deployed to Arduino using TVM.
 
-.. GENERATED FROM PYTHON SOURCE LINES 30-93
-
-.. note::
-
-  This tutorial is best viewed as a Jupyter Notebook. You can download and run it locally
-  using the link at the bottom of this page, or open it online for free using Google Colab.
-  Click the icon below to open in Google Colab.
-
-.. image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.png
-     :align: center
-     :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/a7c7ea4b5017ae70db1f51dd8e6dcd82/micro_train.ipynb
-     :width: 300px
+.. GENERATED FROM PYTHON SOURCE LINES 30-82
 
 Motivation
 ----------
@@ -74,7 +67,7 @@ install ``imagemagick`` and ``curl`` to preprocess data:
 
     .. code-block:: bash
 
-      %%bash
+      %%shell
       pip install -q tensorflow tflite
       pip install -q tlcpack-nightly -f https://tlcpack.ai/wheels
       apt-get -qq install imagemagick curl
@@ -94,7 +87,7 @@ accelerator. If you are running locally, you can `follow TensorFlow's guide <htt
 
 We can test our GPU installation with the following code:
 
-.. GENERATED FROM PYTHON SOURCE LINES 93-102
+.. GENERATED FROM PYTHON SOURCE LINES 82-91
 
 .. code-block:: default
 
@@ -121,7 +114,7 @@ We can test our GPU installation with the following code:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 103-109
+.. GENERATED FROM PYTHON SOURCE LINES 92-98
 
 Choosing Our Work Dir
 ^^^^^^^^^^^^^^^^^^^^^
@@ -130,7 +123,7 @@ will all live. If running on Google Colab, we'll save everything in ``/root`` (a
 probably want to store it elsewhere if running locally. Note that this variable only affects Python
 scripts - you'll have to adjust the Bash commands too.
 
-.. GENERATED FROM PYTHON SOURCE LINES 109-114
+.. GENERATED FROM PYTHON SOURCE LINES 98-103
 
 .. code-block:: default
 
@@ -146,7 +139,7 @@ scripts - you'll have to adjust the Bash commands too.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 120-162
+.. GENERATED FROM PYTHON SOURCE LINES 109-151
 
 Downloading the Data
 --------------------
@@ -191,7 +184,7 @@ during training to correct for this, but training will still work if we ignore i
 take about **2 minutes** to download the Stanford Cars, while COCO 2017 validation will take
 **1 minute**.
 
-.. GENERATED FROM PYTHON SOURCE LINES 162-183
+.. GENERATED FROM PYTHON SOURCE LINES 151-172
 
 .. code-block:: default
 
@@ -225,11 +218,11 @@ take about **2 minutes** to download the Stanford Cars, while COCO 2017 validati
  .. code-block:: none
 
 
-    '/tmp/tmpiciibvn0/images/random'
+    '/tmp/tmpa6k21yod/images/random'
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 184-204
+.. GENERATED FROM PYTHON SOURCE LINES 173-193
 
 Loading the Data
 ----------------
@@ -252,7 +245,7 @@ Lastly, in machine learning we generally want our inputs to be small numbers. We
 instead of ``0`` to ``255``. We need to be careful not to rescale our categorical labels though, so
 we'll use a ``lambda`` function.
 
-.. GENERATED FROM PYTHON SOURCE LINES 204-216
+.. GENERATED FROM PYTHON SOURCE LINES 193-205
 
 .. code-block:: default
 
@@ -281,7 +274,7 @@ we'll use a ``lambda`` function.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 217-222
+.. GENERATED FROM PYTHON SOURCE LINES 206-211
 
 What's Inside Our Dataset?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -289,7 +282,7 @@ Before giving this data set to our neural network, we ought to give it a quick v
 Does the data look properly transformed? Do the labels seem appropriate? And what's our ratio of
 objects to other stuff? We can display some examples from our datasets using ``matplotlib``:
 
-.. GENERATED FROM PYTHON SOURCE LINES 222-241
+.. GENERATED FROM PYTHON SOURCE LINES 211-230
 
 .. code-block:: default
 
@@ -316,7 +309,7 @@ objects to other stuff? We can display some examples from our datasets using ``m
 
 
 .. image-sg:: /how_to/work_with_microtvm/images/sphx_glr_micro_train_001.png
-   :alt: [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0]
+   :alt: [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [1.0, 0.0], [1.0, 0.0]
    :srcset: /how_to/work_with_microtvm/images/sphx_glr_micro_train_001.png
    :class: sphx-glr-single-img
 
@@ -325,13 +318,13 @@ objects to other stuff? We can display some examples from our datasets using ``m
 
  .. code-block:: none
 
-    /tmp/tmpiciibvn0/images/target contains 8144 images
-    /tmp/tmpiciibvn0/images/random contains 5000 images
+    /tmp/tmpa6k21yod/images/target contains 8144 images
+    /tmp/tmpa6k21yod/images/random contains 5000 images
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 242-252
+.. GENERATED FROM PYTHON SOURCE LINES 231-241
 
 Validating our Accuracy
 ^^^^^^^^^^^^^^^^^^^^^^^
@@ -344,7 +337,7 @@ reality. In practice, this "memorizing" is called **overfitting**.
 To prevent this, we will set aside some of the data (we'll use 20%) as a **validation set**. Our
 model will never be trained on validation data - we'll only use it to check our model's accuracy.
 
-.. GENERATED FROM PYTHON SOURCE LINES 252-257
+.. GENERATED FROM PYTHON SOURCE LINES 241-246
 
 .. code-block:: default
 
@@ -360,7 +353,7 @@ model will never be trained on validation data - we'll only use it to check our
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 258-305
+.. GENERATED FROM PYTHON SOURCE LINES 247-294
 
 Loading the Data
 ----------------
@@ -410,7 +403,7 @@ model is called *fine-tuning*.
 Source MobileNets for transfer learning have been `pretrained by the TensorFlow folks <https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md>`_, so we
 can just download the one closest to what we want (the 128x128 input model with 0.25 depth scale).
 
-.. GENERATED FROM PYTHON SOURCE LINES 305-317
+.. GENERATED FROM PYTHON SOURCE LINES 294-306
 
 .. code-block:: default
 
@@ -433,7 +426,7 @@ can just download the one closest to what we want (the 128x128 input model with
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 318-324
+.. GENERATED FROM PYTHON SOURCE LINES 307-313
 
 Modifying Our Network
 ^^^^^^^^^^^^^^^^^^^^^
@@ -442,7 +435,7 @@ but we want to convert it to classify cars. Since only the bottom few layers are
 we'll **cut off the last five layers** of our original model. In their place we'll build our own
 "tail" to the model by performing respape, dropout, flatten, and softmax operations.
 
-.. GENERATED FROM PYTHON SOURCE LINES 324-335
+.. GENERATED FROM PYTHON SOURCE LINES 313-324
 
 .. code-block:: default
 
@@ -464,7 +457,7 @@ we'll **cut off the last five layers** of our original model. In their place we'
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 336-349
+.. GENERATED FROM PYTHON SOURCE LINES 325-338
 
 Fine Tuning Our Network
 ^^^^^^^^^^^^^^^^^^^^^^^
@@ -480,7 +473,7 @@ model is each time we train it, and let us track how our model is improving. Onc
 finished, the model should have a validation accuracy around ``0.98`` (meaning it was right 98% of
 the time on our validation set).
 
-.. GENERATED FROM PYTHON SOURCE LINES 349-357
+.. GENERATED FROM PYTHON SOURCE LINES 338-346
 
 .. code-block:: default
 
@@ -501,17 +494,17 @@ the time on our validation set).
  .. code-block:: none
 
     Epoch 1/3
-    328/328 - 47s - loss: 0.2218 - accuracy: 0.9270 - val_loss: 0.1619 - val_accuracy: 0.9475 - 47s/epoch - 143ms/step
+    328/328 - 47s - loss: 0.2259 - accuracy: 0.9216 - val_loss: 0.1162 - val_accuracy: 0.9562 - 47s/epoch - 142ms/step
     Epoch 2/3
-    328/328 - 43s - loss: 0.0939 - accuracy: 0.9644 - val_loss: 0.1150 - val_accuracy: 0.9615 - 43s/epoch - 132ms/step
+    328/328 - 43s - loss: 0.0972 - accuracy: 0.9643 - val_loss: 0.1257 - val_accuracy: 0.9600 - 43s/epoch - 130ms/step
     Epoch 3/3
-    328/328 - 43s - loss: 0.0621 - accuracy: 0.9759 - val_loss: 0.1155 - val_accuracy: 0.9596 - 43s/epoch - 132ms/step
+    328/328 - 43s - loss: 0.0628 - accuracy: 0.9772 - val_loss: 0.1478 - val_accuracy: 0.9517 - 43s/epoch - 131ms/step
 
-    <keras.callbacks.History object at 0x7f36bcfdcad0>
+    <keras.callbacks.History object at 0x7f243f09af10>
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 358-379
+.. GENERATED FROM PYTHON SOURCE LINES 347-368
 
 Quantization
 ------------
@@ -535,7 +528,7 @@ that is used for tracking how those neurons activate. We'll then pass this into
 the conversion. By default, TFLite keeps the inputs and outputs of our model as floats, so we must
 explicitly tell it to avoid this behavior.
 
-.. GENERATED FROM PYTHON SOURCE LINES 379-395
+.. GENERATED FROM PYTHON SOURCE LINES 368-384
 
 .. code-block:: default
 
@@ -569,7 +562,7 @@ explicitly tell it to avoid this behavior.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 396-403
+.. GENERATED FROM PYTHON SOURCE LINES 385-392
 
 Download the Model if Desired
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -579,7 +572,7 @@ those things, we'll have to write it to a file (``quantized.tflite``). If you're
 tutorial on Google Colab, you'll have to uncomment the last two lines to download the file
 after writing it.
 
-.. GENERATED FROM PYTHON SOURCE LINES 403-410
+.. GENERATED FROM PYTHON SOURCE LINES 392-399
 
 .. code-block:: default
 
@@ -597,7 +590,7 @@ after writing it.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 411-451
+.. GENERATED FROM PYTHON SOURCE LINES 400-440
 
 Compiling With TVM For Arduino
 ------------------------------
@@ -640,7 +633,7 @@ Once we have set these configuration parameters, we will call ``tvm.relay.build`
 Relay model into the MLF intermediate representation. From here, we just need to call
 ``tvm.micro.generate_project`` and pass in the Arduino template project to finish compilation.
 
-.. GENERATED FROM PYTHON SOURCE LINES 451-487
+.. GENERATED FROM PYTHON SOURCE LINES 440-476
 
 .. code-block:: default
 
@@ -687,7 +680,7 @@ Relay model into the MLF intermediate representation. From here, we just need to
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 488-529
+.. GENERATED FROM PYTHON SOURCE LINES 477-518
 
 Testing our Arduino Project
 ---------------------------
@@ -719,7 +712,7 @@ We can do both of these things with a few lines of Bash code:
 
     .. code-block:: bash
 
-      %%bash
+      %%shell
       mkdir -p ~/tests
       curl "https://i.imgur.com/JBbEhxN.png" -o ~/tests/car_224.png
       convert ~/tests/car_224.png -resize 64 ~/tests/car_64.png
@@ -731,7 +724,7 @@ We can do both of these things with a few lines of Bash code:
       stream ~/tests/catan_64.png ~/tests/catan.raw
       bin2c -c -st ~/tests/catan.raw --name CATAN_IMAGE > ~/models/project/catan.c
 
-.. GENERATED FROM PYTHON SOURCE LINES 531-571
+.. GENERATED FROM PYTHON SOURCE LINES 520-560
 
 Writing our Arduino Script
 --------------------------
@@ -774,7 +767,7 @@ compile and flash commands underneath. We could also begin autotuning our model,
 subject for a different tutorial. To finish up, we'll verify no compiler errors are thrown
 by our project:
 
-.. GENERATED FROM PYTHON SOURCE LINES 571-576
+.. GENERATED FROM PYTHON SOURCE LINES 560-565
 
 .. code-block:: default
 
@@ -796,7 +789,7 @@ by our project:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 582-589
+.. GENERATED FROM PYTHON SOURCE LINES 571-578
 
 Uploading to Our Device
 -----------------------
@@ -806,7 +799,7 @@ simple enough to do - we'll just turn our project into a `.zip` archive, and cal
 If you're running on Google Colab, you'll have to uncomment the last two lines to download the file
 after writing it.
 
-.. GENERATED FROM PYTHON SOURCE LINES 589-596
+.. GENERATED FROM PYTHON SOURCE LINES 578-585
 
 .. code-block:: default
 
@@ -824,7 +817,7 @@ after writing it.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 617-651
+.. GENERATED FROM PYTHON SOURCE LINES 606-640
 
 From here, we'll need to open it in the Arduino IDE. You'll have to download the IDE as well as
 the SDK for whichever board you are using. For certain boards like the Sony SPRESENSE, you may
@@ -864,7 +857,7 @@ Arduino tutorial for how to do that `on GitHub <https://github.com/guberti/tvm-a
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 4 minutes  22.904 seconds)
+   **Total running time of the script:** ( 4 minutes  12.686 seconds)
 
 
 .. _sphx_glr_download_how_to_work_with_microtvm_micro_train.py:
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_tvmc.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_tvmc.rst.txt
index d31fd038d0..9f4c027303 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_tvmc.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_tvmc.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/work_with_microtvm/micro_tvmc.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_tvmc.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_tvmc.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/6e511f5a8ddbf12f2fca2dfadc0cc4a9/micro_tvmc.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
diff --git a/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
index 84b8c47739..771c4bc60f 100644
--- a/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
@@ -5,18 +5,18 @@
 
 Computation times
 =================
-**06:29.049** total execution time for **how_to_work_with_microtvm** files:
+**06:16.992** total execution time for **how_to_work_with_microtvm** files:
 
 +---------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_microtvm_micro_train.py` (``micro_train.py``)               | 04:22.904 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_train.py` (``micro_train.py``)               | 04:12.686 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_microtvm_micro_pytorch.py` (``micro_pytorch.py``)           | 01:02.920 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_pytorch.py` (``micro_pytorch.py``)           | 01:01.972 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_microtvm_micro_autotune.py` (``micro_autotune.py``)         | 00:51.437 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_autotune.py` (``micro_autotune.py``)         | 00:50.713 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_microtvm_micro_aot.py` (``micro_aot.py``)                   | 00:07.978 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_aot.py` (``micro_aot.py``)                   | 00:07.872 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_microtvm_micro_tflite.py` (``micro_tflite.py``)             | 00:03.808 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_tflite.py` (``micro_tflite.py``)             | 00:03.747 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_work_with_microtvm_micro_reference_vm.py` (``micro_reference_vm.py``) | 00:00.001 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/work_with_relay/build_gcn.rst.txt b/docs/_sources/how_to/work_with_relay/build_gcn.rst.txt
index d3a76e38e1..f456bcdfe1 100644
--- a/docs/_sources/how_to/work_with_relay/build_gcn.rst.txt
+++ b/docs/_sources/how_to/work_with_relay/build_gcn.rst.txt
@@ -1,17 +1,21 @@
 
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
+.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
+.. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "how_to/work_with_relay/build_gcn.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
 
 .. only:: html
 
     .. note::
         :class: sphx-glr-download-link-note
 
-        Click :ref:`here <sphx_glr_download_how_to_work_with_relay_build_gcn.py>`
-        to download the full example code
+        This tutorial can be used interactively with Google Colab! You can also click
+        :ref:`here <sphx_glr_download_how_to_work_with_relay_build_gcn.py>` to run the Jupyter notebook locally.
+
+        .. image:: https://raw.githubusercontent.com/apache/web-data/main/images/utilities/colab_button.svg
+            :align: center
+            :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/825671e45a9bdc4733400384984cd9dd/build_gcn.ipynb
+            :width: 300px
 
 .. rst-class:: sphx-glr-example-title
 
@@ -27,13 +31,19 @@ In this tutorial, we will run our GCN on Cora dataset to demonstrate.
 Cora dataset is a common benchmark for Graph Neural Networks (GNN) and frameworks that support GNN training and inference.
 We directly load the dataset from DGL library to do the apples to apples comparison against DGL.
 
-Please refer to DGL doc for DGL installation at
+.. code-block:: bash
+
+    %%shell
+    pip install torch==1.9.0
+    pip install dgl==v0.7.2 -f https://data.dgl.ai/wheels/repo.html
+
+Please refer to DGL doc for installation at
 https://docs.dgl.ai/install/index.html.
 
 Please refer to PyTorch guide for PyTorch installation at
 https://pytorch.org/get-started/locally/.
 
-.. GENERATED FROM PYTHON SOURCE LINES 37-42
+.. GENERATED FROM PYTHON SOURCE LINES 43-48
 
 Define GCN in DGL with PyTorch backend
 --------------------------------------
@@ -41,7 +51,7 @@ Define GCN in DGL with PyTorch backend
 DGL example: https://github.com/dmlc/dgl/tree/master/examples/pytorch/gcn
 This part reuses the code from the above example.
 
-.. GENERATED FROM PYTHON SOURCE LINES 42-71
+.. GENERATED FROM PYTHON SOURCE LINES 48-77
 
 .. code-block:: default
 
@@ -99,13 +109,13 @@ This part reuses the code from the above example.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 72-75
+.. GENERATED FROM PYTHON SOURCE LINES 78-81
 
 Define the functions to load dataset and evaluate accuracy
 ----------------------------------------------------------
 You may substitute this part with your own dataset, here we load data from DGL
 
-.. GENERATED FROM PYTHON SOURCE LINES 75-100
+.. GENERATED FROM PYTHON SOURCE LINES 81-106
 
 .. code-block:: default
 
@@ -141,12 +151,12 @@ You may substitute this part with your own dataset, here we load data from DGL
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 101-103
+.. GENERATED FROM PYTHON SOURCE LINES 107-109
 
 Load the data and set up model parameters
 -----------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 103-130
+.. GENERATED FROM PYTHON SOURCE LINES 109-136
 
 .. code-block:: default
 
@@ -206,14 +216,14 @@ Load the data and set up model parameters
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 136-140
+.. GENERATED FROM PYTHON SOURCE LINES 142-146
 
 Set up the DGL-PyTorch model and get the golden results
 -------------------------------------------------------
 
 The weights are trained with https://github.com/dmlc/dgl/blob/master/examples/pytorch/gcn/train.py
 
-.. GENERATED FROM PYTHON SOURCE LINES 140-156
+.. GENERATED FROM PYTHON SOURCE LINES 146-162
 
 .. code-block:: default
 
@@ -250,12 +260,12 @@ The weights are trained with https://github.com/dmlc/dgl/blob/master/examples/py
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 157-159
+.. GENERATED FROM PYTHON SOURCE LINES 163-165
 
 Run the DGL model and test for accuracy
 ---------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 159-168
+.. GENERATED FROM PYTHON SOURCE LINES 165-174
 
 .. code-block:: default
 
@@ -291,7 +301,7 @@ Run the DGL model and test for accuracy
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 169-182
+.. GENERATED FROM PYTHON SOURCE LINES 175-188
 
 Define Graph Convolution Layer in Relay
 ---------------------------------------
@@ -307,7 +317,7 @@ this method is temporary and will be updated in next few weeks when we have spar
                                        = ((H * W)^t * A^t)^t
                                        = ((W^t * H^t) * A^t)^t
 
-.. GENERATED FROM PYTHON SOURCE LINES 182-240
+.. GENERATED FROM PYTHON SOURCE LINES 188-246
 
 .. code-block:: default
 
@@ -376,13 +386,13 @@ this method is temporary and will be updated in next few weeks when we have spar
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 241-244
+.. GENERATED FROM PYTHON SOURCE LINES 247-250
... 10300 lines suppressed ...