You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by tq...@apache.org on 2022/04/15 19:19:11 UTC

[tvm-site] branch asf-site updated: deploying docs (apache/tvm@8bfe3bbb3cc221a8e5d1063f72c1c193c6af5bd9)

This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/tvm-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 1fe5af359 deploying docs (apache/tvm@8bfe3bbb3cc221a8e5d1063f72c1c193c6af5bd9)
1fe5af359 is described below

commit 1fe5af359a632f3f85336697a4980a3e37080d8d
Author: tvm-bot <95...@users.noreply.github.com>
AuthorDate: Fri Apr 15 19:19:05 2022 +0000

    deploying docs (apache/tvm@8bfe3bbb3cc221a8e5d1063f72c1c193c6af5bd9)
---
 .../how_to/compile_models/from_darknet.rst.txt     |   5 -
 .../how_to/compile_models/from_mxnet.rst.txt       |   2 +-
 .../how_to/compile_models/from_paddle.rst.txt      |   2 +-
 .../how_to/compile_models/from_pytorch.rst.txt     |   2 +-
 .../how_to/compile_models/from_tensorflow.rst.txt  |   5 -
 .../compile_models/sg_execution_times.rst.txt      |  20 +-
 .../deploy_models/deploy_model_on_android.rst.txt  |   2 +-
 .../deploy_object_detection_pytorch.rst.txt        |   4 +-
 .../deploy_models/deploy_prequantized.rst.txt      |   6 +-
 .../deploy_prequantized_tflite.rst.txt             |   4 +-
 .../how_to/deploy_models/deploy_quantized.rst.txt  |   2 +-
 .../deploy_models/deploy_ssd_gluoncv.rst.txt       |   4 +-
 .../deploy_models/sg_execution_times.rst.txt       |  18 +-
 .../extend_tvm/bring_your_own_datatypes.rst.txt    |   2 +-
 .../how_to/extend_tvm/sg_execution_times.rst.txt   |  10 +-
 .../how_to/extend_tvm/use_pass_instrument.rst.txt  |  16 +-
 .../optimize_operators/opt_conv_cuda.rst.txt       |   2 +-
 .../optimize_operators/opt_conv_tensorcore.rst.txt |   2 +-
 .../how_to/optimize_operators/opt_gemm.rst.txt     |  16 +-
 .../optimize_operators/sg_execution_times.rst.txt  |   8 +-
 .../sg_execution_times.rst.txt                     |  16 +-
 .../tune_conv2d_layer_cuda.rst.txt                 | 291 ++++++++-------------
 .../tune_network_cuda.rst.txt                      |   2 +-
 .../tune_network_x86.rst.txt                       |   4 +-
 .../tune_sparse_x86.rst.txt                        |  86 ++----
 .../tune_with_autotvm/sg_execution_times.rst.txt   |  12 +-
 .../tune_with_autotvm/tune_conv2d_cuda.rst.txt     |  34 +--
 .../work_with_microtvm/micro_autotune.rst.txt      |  16 +-
 .../work_with_microtvm/sg_execution_times.rst.txt  |  10 +-
 .../work_with_relay/sg_execution_times.rst.txt     |   8 +-
 .../work_with_schedules/sg_execution_times.rst.txt |  18 +-
 .../how_to/work_with_schedules/tensorize.rst.txt   |   2 +-
 .../tutorials/autotvm/sg_execution_times.rst.txt   |   6 +-
 .../frontend/deploy_classification.rst.txt         |   2 +-
 .../tutorials/frontend/deploy_detection.rst.txt    |   2 +-
 .../tutorials/frontend/sg_execution_times.rst.txt  |   6 +-
 .../tutorials/optimize/sg_execution_times.rst.txt  |   6 +-
 .../topic/vta/tutorials/sg_execution_times.rst.txt |   6 +-
 .../tutorial/auto_scheduler_matmul_x86.rst.txt     |   7 +-
 docs/_sources/tutorial/autotvm_relay_x86.rst.txt   |  68 +++--
 .../tutorial/cross_compilation_and_rpc.rst.txt     |   2 +-
 docs/_sources/tutorial/intro_topi.rst.txt          |   2 +-
 docs/_sources/tutorial/sg_execution_times.rst.txt  |  26 +-
 .../tutorial/tensor_expr_get_started.rst.txt       |  47 ++--
 docs/commit_hash                                   |   2 +-
 docs/how_to/compile_models/from_darknet.html       |   1 -
 docs/how_to/compile_models/from_mxnet.html         |   2 +-
 docs/how_to/compile_models/from_paddle.html        |   2 +-
 docs/how_to/compile_models/from_pytorch.html       |   7 +-
 docs/how_to/compile_models/from_tensorflow.html    |   1 -
 docs/how_to/compile_models/sg_execution_times.html |  20 +-
 .../deploy_models/deploy_model_on_android.html     |   2 +-
 .../deploy_object_detection_pytorch.html           |  44 +---
 docs/how_to/deploy_models/deploy_prequantized.html |   8 +-
 .../deploy_models/deploy_prequantized_tflite.html  |   4 +-
 docs/how_to/deploy_models/deploy_quantized.html    |   2 +-
 docs/how_to/deploy_models/deploy_ssd_gluoncv.html  |  39 +--
 docs/how_to/deploy_models/sg_execution_times.html  |  18 +-
 .../extend_tvm/bring_your_own_datatypes.html       |   2 +-
 docs/how_to/extend_tvm/sg_execution_times.html     |  10 +-
 docs/how_to/extend_tvm/use_pass_instrument.html    |  16 +-
 docs/how_to/optimize_operators/opt_conv_cuda.html  |   2 +-
 .../optimize_operators/opt_conv_tensorcore.html    |   2 +-
 docs/how_to/optimize_operators/opt_gemm.html       |  16 +-
 .../optimize_operators/sg_execution_times.html     |   8 +-
 .../sg_execution_times.html                        |  14 +-
 .../tune_conv2d_layer_cuda.html                    | 291 ++++++++-------------
 .../tune_with_autoscheduler/tune_network_cuda.html |   2 +-
 .../tune_with_autoscheduler/tune_network_x86.html  |   4 +-
 .../tune_with_autoscheduler/tune_sparse_x86.html   |  86 ++----
 .../tune_with_autotvm/sg_execution_times.html      |  12 +-
 .../how_to/tune_with_autotvm/tune_conv2d_cuda.html |  34 +--
 docs/how_to/work_with_microtvm/micro_autotune.html |  16 +-
 .../work_with_microtvm/sg_execution_times.html     |  10 +-
 .../how_to/work_with_relay/sg_execution_times.html |   8 +-
 .../work_with_schedules/sg_execution_times.html    |  18 +-
 docs/how_to/work_with_schedules/tensorize.html     |   2 +-
 .../api/doxygen/iter__affine__map_8h.html          |   6 +-
 .../api/doxygen/iter__affine__map_8h_source.html   |   4 +-
 docs/reference/api/doxygen/namespacemembers_d.html |   2 +-
 .../api/doxygen/namespacemembers_func_d.html       |   2 +-
 .../api/doxygen/namespacetvm_1_1arith.html         |  23 +-
 docs/reference/api/doxygen/search/all_5.js         |   2 +-
 docs/reference/api/doxygen/search/functions_4.js   |   2 +-
 docs/reference/api/python/auto_scheduler.html      |   4 +-
 .../api/typedoc/classes/bytestreamreader.html      |  12 +-
 .../api/typedoc/classes/cachedcallstack.html       |  34 +--
 docs/reference/api/typedoc/classes/dldatatype.html |  12 +-
 docs/reference/api/typedoc/classes/dldevice.html   |  10 +-
 .../reference/api/typedoc/classes/environment.html |  12 +-
 docs/reference/api/typedoc/classes/ffilibrary.html |  20 +-
 .../api/typedoc/classes/graphexecutor.html         |  16 +-
 docs/reference/api/typedoc/classes/instance.html   |  40 +--
 docs/reference/api/typedoc/classes/memory.html     |  34 +--
 docs/reference/api/typedoc/classes/module.html     |  10 +-
 docs/reference/api/typedoc/classes/ndarray.html    |  22 +-
 .../api/typedoc/classes/packedfunccell.html        |   6 +-
 docs/reference/api/typedoc/classes/rpcserver.html  |  14 +-
 docs/reference/api/typedoc/classes/scalar.html     |   6 +-
 .../api/typedoc/classes/webgpucontext.html         |  12 +-
 docs/reference/api/typedoc/enums/argtypecode.html  |  30 +--
 .../api/typedoc/enums/aynccallbackcode.html        |   4 +-
 .../api/typedoc/enums/dldatatypecode.html          |   8 +-
 .../api/typedoc/enums/rpcserverstate.html          |  12 +-
 docs/reference/api/typedoc/enums/sizeof.html       |  18 +-
 docs/reference/api/typedoc/index.html              | 112 ++++----
 .../api/typedoc/interfaces/disposable.html         |   2 +-
 .../api/typedoc/interfaces/functioninfo.html       |   6 +-
 .../api/typedoc/interfaces/libraryprovider.html    |   4 +-
 docs/searchindex.js                                |   2 +-
 .../vta/tutorials/autotvm/sg_execution_times.html  |   6 +-
 .../tutorials/frontend/deploy_classification.html  |   2 +-
 .../vta/tutorials/frontend/deploy_detection.html   |   2 +-
 .../vta/tutorials/frontend/sg_execution_times.html |   6 +-
 .../vta/tutorials/optimize/sg_execution_times.html |   6 +-
 docs/topic/vta/tutorials/sg_execution_times.html   |   6 +-
 docs/tutorial/auto_scheduler_matmul_x86.html       |   3 +-
 docs/tutorial/autotvm_relay_x86.html               | 172 ++++++------
 docs/tutorial/cross_compilation_and_rpc.html       |   2 +-
 docs/tutorial/intro_topi.html                      |   2 +-
 docs/tutorial/sg_execution_times.html              |  26 +-
 docs/tutorial/tensor_expr_get_started.html         |  43 ++-
 122 files changed, 1005 insertions(+), 1290 deletions(-)

diff --git a/docs/_sources/how_to/compile_models/from_darknet.rst.txt b/docs/_sources/how_to/compile_models/from_darknet.rst.txt
index 52a45fbe7..d19d70d36 100644
--- a/docs/_sources/how_to/compile_models/from_darknet.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_darknet.rst.txt
@@ -285,11 +285,6 @@ The process is no different from other examples.
 
 
 
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 1 minutes  0.170 seconds)
-
-
 .. _sphx_glr_download_how_to_compile_models_from_darknet.py:
 
 
diff --git a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
index 3fc01fb3b..f8861054a 100644
--- a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
@@ -98,7 +98,7 @@ In this section, we download a pretrained imagenet model and classify an image.
 
  .. code-block:: none
 
-    Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zipa9c2d860-11ff-4171-bbe1-7e0abaa3ad4a from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
+    Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zipeed34e15-264c-4349-9328-89248cd4ff42 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
     x (1, 3, 224, 224)
 
 
diff --git a/docs/_sources/how_to/compile_models/from_paddle.rst.txt b/docs/_sources/how_to/compile_models/from_paddle.rst.txt
index bd5175765..9303c583f 100644
--- a/docs/_sources/how_to/compile_models/from_paddle.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_paddle.rst.txt
@@ -201,7 +201,7 @@ Look up prediction top 1 index in 1000 class synset.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  5.922 seconds)
+   **Total running time of the script:** ( 1 minutes  4.311 seconds)
 
 
 .. _sphx_glr_download_how_to_compile_models_from_paddle.py:
diff --git a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
index 83986b90b..fe2c0841e 100644
--- a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
@@ -79,7 +79,7 @@ Load a pretrained PyTorch model
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /workspace/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
-
      0%|          | 0.00/44.7M [00:00<?, ?B/s]
     31%|###       | 13.8M/44.7M [00:00<00:00, 145MB/s]
     84%|########3 | 37.5M/44.7M [00:00<00:00, 205MB/s]
    100%|##########| 44.7M/44.7M [00:00<00:00, 204MB/s]
+
      0%|          | 0.00/44.7M [00:00<?, ?B/s]
     12%|#2        | 5.53M/44.7M [00:00<00:00, 58.0MB/s]
     25%|##4       | 11.1M/44.7M [00:00<00:00, 55.1MB/s]
     76%|#######6  | 34.0M/44.7M [00:00<00:00, 138MB/s] 
    100%|##########| 44.7M/44.7M [00:00<00:00, 133MB/s]
 
 
 
diff --git a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
index 8855b970f..43d9b260b 100644
--- a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
@@ -370,11 +370,6 @@ Run the corresponding model on tensorflow
 
 
 
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 1 minutes  4.616 seconds)
-
-
 .. _sphx_glr_download_how_to_compile_models_from_tensorflow.py:
 
 
diff --git a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
index beaa9d5c0..0bab07420 100644
--- a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
@@ -5,14 +5,14 @@
 
 Computation times
 =================
-**04:57.537** total execution time for **how_to_compile_models** files:
+**04:42.004** total execution time for **how_to_compile_models** files:
 
-- **01:05.922**: :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)
-- **01:04.616**: :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``)
-- **01:00.170**: :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)
-- **00:25.787**: :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)
-- **00:22.201**: :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)
-- **00:21.740**: :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)
-- **00:19.544**: :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)
-- **00:14.715**: :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)
-- **00:02.843**: :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)
+- **01:04.311**: :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)
+- **00:58.956**: :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``)
+- **00:56.850**: :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)
+- **00:25.312**: :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)
+- **00:21.831**: :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)
+- **00:20.843**: :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)
+- **00:18.767**: :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)
+- **00:12.671**: :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)
+- **00:02.462**: :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)
diff --git a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
index fa9ff12bd..b4f1a9c54 100644
--- a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
@@ -393,7 +393,7 @@ Execute on TVM
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      15.8510      15.8504      16.0153      15.6463       0.1026   
+      15.7319      15.6878      16.1813      15.5590       0.1696   
                
 
 
diff --git a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
index 83b9e3f8a..cec918028 100644
--- a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
@@ -108,7 +108,7 @@ Load pre-trained maskrcnn from torchvision and do tracing
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /workspace/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
-
      0%|          | 0.00/170M [00:00<?, ?B/s]
      3%|2         | 4.36M/170M [00:00<00:03, 45.7MB/s]
      5%|5         | 9.01M/170M [00:00<00:03, 47.4MB/s]
      9%|8         | 14.9M/170M [00:00<00:03, 53.7MB/s]
     12%|#1        | 20.0M/170M [00:00<00:02, 53.7MB/s]
     15%|#4        | 25.2M/170M [00:00<00:02, 53.9MB/s]
     18%|#7        | 30.3M/170M [00:00<00:02, 51.9MB/s]
     21%|##        | 35.3M/170M [00:00<00:02, 51.1MB/s]
     24%|##3       | 40.2M/170M [00:00<00:02, 50.2MB/s]
     27%|##6       | 45.1M/170M [00:00<00:02, 50.4MB/s]
     29%|##9       | 49.9M/170M [00:01<00:02, 50.0MB/s]
     32%|###2      | 54.7M/170M [00:01<00:02, 47.5MB/s]
     35%|###4      | 59.2M/170M [00:01<00:02, 47.4MB/s]
     38%|###8      | 64.8M/170M [00:01<00:02, 50.6MB/s]
     41%|####1     | 70.1M/170M [00:01<00:02, 52.1MB/s]
     44%|####4     | 75.1M/170M [00:01<00:02, 45.3MB/s]
     48%|####7     | 80.9M/170M [00:01<00:01, 49.5MB/s]
     51%|#####     | 85.8M/170M [00:01<00:01, 49.5MB/
 s]
     54%|#####4    | 91.9M/170M [00:01<00:01, 53.4MB/s]
     57%|#####7    | 97.1M/170M [00:02<00:01, 52.4MB/s]
     60%|######    | 102M/170M [00:02<00:01, 53.6MB/s] 
     63%|######3   | 108M/170M [00:02<00:01, 48.3MB/s]
     66%|######6   | 112M/170M [00:02<00:01, 47.1MB/s]
     69%|######8   | 117M/170M [00:02<00:01, 47.6MB/s]
     72%|#######1  | 122M/170M [00:02<00:01, 48.8MB/s]
     76%|#######5  | 128M/170M [00:02<00:00, 53.6MB/s]
     79%|#######8  | 134M/170M [00:02<00:00, 52.3MB/s]
     82%|########1 | 139M/170M [00:02<00:00, 52.5MB/s]
     85%|########4 | 144M/170M [00:02<00:00, 52.6MB/s]
     88%|########7 | 149M/170M [00:03<00:00, 54.5MB/s]
     91%|#########1| 155M/170M [00:03<00:00, 51.6MB/s]
     94%|#########3| 160M/170M [00:03<00:00, 51.6MB/s]
     97%|#########6| 165M/170M [00:03<00:00, 47.0MB/s]
    100%|#########9| 169M/170M [00:03<00:00, 44.8MB/s]
    100%|##########| 170M/170M [00:03<00:00, 49.9MB/s]
+
      0%|          | 0.00/170M [00:00<?, ?B/s]
     10%|9         | 16.3M/170M [00:00<00:00, 171MB/s]
     24%|##3       | 40.4M/170M [00:00<00:00, 219MB/s]
     38%|###7      | 64.4M/170M [00:00<00:00, 234MB/s]
     52%|#####2    | 88.4M/170M [00:00<00:00, 241MB/s]
     66%|######6   | 113M/170M [00:00<00:00, 245MB/s] 
     80%|########  | 136M/170M [00:00<00:00, 246MB/s]
     94%|#########4| 160M/170M [00:00<00:00, 247MB/s]
    100%|##########| 170M/170M [00:00<00:00, 240MB/s]
     /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3878: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
       for i in range(dim)
     /usr/local/lib/python3.7/dist-packages/torchvision/models/detection/anchor_utils.py:127: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
@@ -253,7 +253,7 @@ Get boxes with score larger than 0.9
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 3 minutes  8.653 seconds)
+   **Total running time of the script:** ( 2 minutes  59.000 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_object_detection_pytorch.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
index 6bfcc4fed..7a7078745 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
@@ -187,7 +187,7 @@ training. Other models require a full post training calibration.
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /workspace/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
-
      0%|          | 0.00/13.6M [00:00<?, ?B/s]
    100%|##########| 13.6M/13.6M [00:00<00:00, 145MB/s]
+
      0%|          | 0.00/13.6M [00:00<?, ?B/s]
     28%|##8       | 3.83M/13.6M [00:00<00:00, 40.1MB/s]
     59%|#####9    | 8.06M/13.6M [00:00<00:00, 42.6MB/s]
    100%|##########| 13.6M/13.6M [00:00<00:00, 63.0MB/s]
 
 
 
@@ -344,7 +344,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      90.2464      90.1483      91.1529      90.0241       0.2455   
+      90.0879      90.0210      90.8503      89.8837       0.1930   
                
 
 
@@ -384,7 +384,7 @@ TODO
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  6.301 seconds)
+   **Total running time of the script:** ( 1 minutes  3.828 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_prequantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
index 7a3efb08f..1e413a9c7 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
@@ -351,7 +351,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      119.9655     119.9535     121.8199     119.1104      0.3561   
+      120.2053     120.0332     129.6409     119.2245      1.0663   
                
 
 
@@ -385,7 +385,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  52.691 seconds)
+   **Total running time of the script:** ( 1 minutes  51.148 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_prequantized_tflite.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
index 103f11862..bb6d55d24 100644
--- a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
@@ -221,7 +221,7 @@ We create a Relay VM to build and execute the model.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  15.594 seconds)
+   **Total running time of the script:** ( 1 minutes  11.439 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_quantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
index 2c164bf67..46a2d077a 100644
--- a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
@@ -137,7 +137,7 @@ Convert and compile model for CPU.
             data: None
       input_sym_arg_type = in_param.infer_type()[0]
     Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/ssd_512_resnet50_v1_voc-9c8b225a.zip...
-
      0%|          | 0/132723 [00:00<?, ?KB/s]
      2%|2         | 2805/132723 [00:00<00:04, 28046.95KB/s]
      6%|6         | 8209/132723 [00:00<00:02, 43333.28KB/s]
     12%|#2        | 16549/132723 [00:00<00:01, 61625.06KB/s]
     19%|#8        | 25068/132723 [00:00<00:01, 70924.89KB/s]
     25%|##5       | 33569/132723 [00:00<00:01, 76000.78KB/s]
     32%|###1      | 42087/132723 [00:00<00:01, 79119.22KB/s]
     38%|###8      | 50584/132723 [00:00<00:01, 81029.42KB/s]
     45%|####4     | 59145/132723 [00:00<00:00, 82476.70KB/s]
     51%|#####     | 67594/132723 [00:00<00:00, 83102.80KB/s]
     57%|#####7    | 76168/132723 [00:01<00:00, 83915.47KB/s]
     64%|######3   | 84560/132723 [00:01<00:00, 71882.90KB/s]
     69%|######9   | 92044/132723 [00:01<00:00, 71191.74KB/s]
     75%|#######4  | 99365/132723 [00:01<00:00, 71733.12KB/s]
     80%|########  | 106684/132723 [00:01<00:00, 58403.00KB/s]
     86%|########5 | 114001/132723 [00:01<00:00, 62039.63KB/s]
     91%|######### 
 | 120618/132723 [00:01<00:00, 48370.80KB/s]
     97%|#########7| 129142/132723 [00:01<00:00, 56547.44KB/s]
    100%|##########| 132723/132723 [00:02<00:00, 66173.10KB/s]
+
      0%|          | 0/132723 [00:00<?, ?KB/s]
      3%|3         | 4144/132723 [00:00<00:03, 41437.27KB/s]
      9%|8         | 11290/132723 [00:00<00:02, 59093.28KB/s]
     14%|#4        | 18751/132723 [00:00<00:01, 66174.53KB/s]
     20%|##        | 26640/132723 [00:00<00:01, 71191.75KB/s]
     25%|##5       | 33760/132723 [00:00<00:01, 69404.79KB/s]
     31%|###1      | 41725/132723 [00:00<00:01, 72826.56KB/s]
     38%|###7      | 49800/132723 [00:00<00:01, 75383.68KB/s]
     43%|####3     | 57680/132723 [00:00<00:00, 76457.72KB/s]
     49%|####9     | 65334/132723 [00:00<00:00, 76466.72KB/s]
     55%|#####4    | 72987/132723 [00:01<00:00, 76404.53KB/s]
     61%|######    | 80632/132723 [00:01<00:00, 76351.90KB/s]
     67%|######6   | 88270/132723 [00:01<00:00, 66500.77KB/s]
     72%|#######1  | 95145/132723 [00:01<00:00, 65937.94KB/s]
     77%|#######6  | 101893/132723 [00:01<00:00, 53808.17KB/s]
     83%|########2 | 109644/132723 [00:01<00:00, 59575.68KB/s]
     88%|########7
  | 116327/132723 [00:01<00:00, 61443.14KB/s]
     94%|#########3| 124123/132723 [00:01<00:00, 65889.42KB/s]
     99%|#########9| 131929/132723 [00:01<00:00, 69268.25KB/s]
    100%|##########| 132723/132723 [00:01<00:00, 67766.51KB/s]
 
 
 
@@ -202,7 +202,7 @@ Display result
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  25.317 seconds)
+   **Total running time of the script:** ( 2 minutes  20.659 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_ssd_gluoncv.py:
diff --git a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
index 898372f0d..b39158084 100644
--- a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
@@ -5,13 +5,13 @@
 
 Computation times
 =================
-**10:39.563** total execution time for **how_to_deploy_models** files:
+**10:14.652** total execution time for **how_to_deploy_models** files:
 
-- **03:08.653**: :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``)
-- **02:25.317**: :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)
-- **01:52.691**: :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)
-- **01:15.594**: :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)
-- **01:06.301**: :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)
-- **00:29.042**: :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)
-- **00:21.770**: :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)
-- **00:00.193**: :ref:`sphx_glr_how_to_deploy_models_deploy_sparse.py` (``deploy_sparse.py``)
+- **02:58.1000**: :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``)
+- **02:20.659**: :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)
+- **01:51.148**: :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)
+- **01:11.439**: :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)
+- **01:03.828**: :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)
+- **00:27.257**: :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)
+- **00:21.141**: :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)
+- **00:00.181**: :ref:`sphx_glr_how_to_deploy_models_deploy_sparse.py` (``deploy_sparse.py``)
diff --git a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
index 18e52d949..87e9d3d64 100644
--- a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
@@ -423,7 +423,7 @@ First let us define two helper functions to get the mobilenet model and a cat im
 
  .. code-block:: none
 
-    Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zipbb3576fa-0825-4ae2-a4d0-31959dcffe09 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
+    Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip7b13c604-a211-40c8-8a74-68a208b9f928 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
 
 
 
diff --git a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
index 24d7152f3..c25d8a7bf 100644
--- a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
@@ -5,9 +5,9 @@
 
 Computation times
 =================
-**00:38.436** total execution time for **how_to_extend_tvm** files:
+**00:37.937** total execution time for **how_to_extend_tvm** files:
 
-- **00:34.919**: :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``)
-- **00:02.273**: :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)
-- **00:01.051**: :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)
-- **00:00.194**: :ref:`sphx_glr_how_to_extend_tvm_low_level_custom_pass.py` (``low_level_custom_pass.py``)
+- **00:34.488**: :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``)
+- **00:02.219**: :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)
+- **00:01.044**: :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)
+- **00:00.186**: :ref:`sphx_glr_how_to_extend_tvm_low_level_custom_pass.py` (``low_level_custom_pass.py``)
diff --git a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
index 442c634cd..4d8443967 100644
--- a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
@@ -199,10 +199,10 @@ profile the execution time of each passes.
  .. code-block:: none
 
     Printing results of timing profile...
-    InferType: 6061us [6061us] (45.57%; 45.57%)
-    FoldScaleAxis: 7240us [2us] (54.43%; 54.43%)
-            FoldConstant: 7238us [1483us] (54.42%; 99.97%)
-                    InferType: 5756us [5756us] (43.27%; 79.52%)
+    InferType: 6264us [6264us] (46.04%; 46.04%)
+    FoldScaleAxis: 7341us [3us] (53.96%; 53.96%)
+            FoldConstant: 7338us [1498us] (53.94%; 99.96%)
+                    InferType: 5840us [5840us] (42.93%; 79.59%)
 
 
 
@@ -239,10 +239,10 @@ Refer to following sections and :py:func:`tvm.instrument.pass_instrument` for th
  .. code-block:: none
 
     Printing results of timing profile...
-    InferType: 5825us [5825us] (44.84%; 44.84%)
-    FoldScaleAxis: 7165us [2us] (55.16%; 55.16%)
-            FoldConstant: 7163us [1493us] (55.14%; 99.97%)
-                    InferType: 5671us [5671us] (43.65%; 79.16%)
+    InferType: 5868us [5868us] (44.66%; 44.66%)
+    FoldScaleAxis: 7271us [2us] (55.34%; 55.34%)
+            FoldConstant: 7269us [1511us] (55.32%; 99.97%)
+                    InferType: 5758us [5758us] (43.83%; 79.22%)
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
index b16a17629..69e1ff51f 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
@@ -295,7 +295,7 @@ latency of convolution.
 
  .. code-block:: none
 
-    Convolution: 51.352703 ms
+    Convolution: 54.117038 ms
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
index aa3f9d05b..b06787e59 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
@@ -626,7 +626,7 @@ be able to run on our build server
 
  .. code-block:: none
 
-    conv2d with tensor core: 6.634472 ms
+    conv2d with tensor core: 6.532508 ms
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
index 845a80ea3..7e0f76b9e 100644
--- a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
@@ -118,8 +118,8 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
 
  .. code-block:: none
 
-    Numpy running time: 0.019783
-    Baseline: 3.546348
+    Numpy running time: 0.018150
+    Baseline: 3.239384
 
 
 
@@ -209,7 +209,7 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
 
  .. code-block:: none
 
-    Opt1: 0.322029
+    Opt1: 0.297664
 
 
 
@@ -307,7 +307,7 @@ In this tutorial, we chose to vectorize the inner loop row data since it is cach
 
  .. code-block:: none
 
-    Opt2: 0.348650
+    Opt2: 0.338107
 
 
 
@@ -398,7 +398,7 @@ the access pattern for A matrix is more cache friendly.
 
  .. code-block:: none
 
-    Opt3: 0.118034
+    Opt3: 0.115493
 
 
 
@@ -516,7 +516,7 @@ flattening.
 
  .. code-block:: none
 
-    Opt4: 0.111703
+    Opt4: 0.110447
 
 
 
@@ -633,7 +633,7 @@ write to C when all the block results are ready.
 
  .. code-block:: none
 
-    Opt5: 0.111723
+    Opt5: 0.111191
 
 
 
@@ -753,7 +753,7 @@ Futhermore, we can also utilize multi-core processors to do the thread-level par
 
  .. code-block:: none
 
-    Opt6: 0.144970
+    Opt6: 0.143822
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
index 3094a4ad8..f36e6408b 100644
--- a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
@@ -5,8 +5,8 @@
 
 Computation times
 =================
-**00:35.915** total execution time for **how_to_optimize_operators** files:
+**00:34.280** total execution time for **how_to_optimize_operators** files:
 
-- **00:33.326**: :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)
-- **00:01.370**: :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``)
-- **00:01.219**: :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)
+- **00:31.695**: :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)
+- **00:01.371**: :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``)
+- **00:01.214**: :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
index 213b6f41a..b181138a2 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
@@ -5,11 +5,11 @@
 
 Computation times
 =================
-**04:55.287** total execution time for **how_to_tune_with_autoscheduler** files:
-
-- **02:19.285**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``)
-- **01:20.977**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)
-- **00:40.523**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)
-- **00:16.676**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)
-- **00:08.992**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)
-- **00:08.835**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)
+**04:53.728** total execution time for **how_to_tune_with_autoscheduler** files:
+
+- **02:20.981**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``)
+- **01:19.068**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)
+- **00:40.131**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)
+- **00:16.628**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)
+- **00:08.632**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)
+- **00:08.287**: :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
index 4ad0d6a7d..a3a683d12 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
@@ -222,110 +222,70 @@ cooperative fetching, unrolling and operator fusion.
                  compute: Buffer(compute_2: Pointer(float32), float32, [25088], [])}
       buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} {
       attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 16;
-      allocate(conv2d_nchw: Pointer(local float32), float32, [14]), storage_scope = local;
-      allocate(pad_temp.shared: Pointer(shared float32), float32, [504]), storage_scope = shared;
-      allocate(kernel.shared: Pointer(shared float32), float32, [768]), storage_scope = shared;
-      attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {
-        conv2d_nchw_1: Buffer(conv2d_nchw, float32, [14], [], scope="local", align=32)[0] = 0f32
+      allocate(conv2d_nchw: Pointer(local float32), float32, [7]), storage_scope = local;
+      allocate(pad_temp.shared: Pointer(shared float32), float32, [1008]), storage_scope = shared;
+      allocate(kernel.shared: Pointer(shared float32), float32, [1536]), storage_scope = shared;
+      attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 224 {
+        conv2d_nchw_1: Buffer(conv2d_nchw, float32, [7], [], scope="local", align=16)[0] = 0f32
         conv2d_nchw_1[1] = 0f32
         conv2d_nchw_1[2] = 0f32
         conv2d_nchw_1[3] = 0f32
         conv2d_nchw_1[4] = 0f32
         conv2d_nchw_1[5] = 0f32
         conv2d_nchw_1[6] = 0f32
-        conv2d_nchw_1[7] = 0f32
-        conv2d_nchw_1[8] = 0f32
-        conv2d_nchw_1[9] = 0f32
-        conv2d_nchw_1[10] = 0f32
-        conv2d_nchw_1[11] = 0f32
-        conv2d_nchw_1[12] = 0f32
-        conv2d_nchw_1[13] = 0f32
-        for (rc.outer.outer: int32, 0, 64) {
-          for (ry.outer.outer: int32, 0, 3) {
-            let cse_var_4: int32 = (rc.outer.outer*392)
-            let cse_var_3: int32 = (ry.outer.outer*7)
-            let cse_var_2: int32 = (rc.outer.outer*72)
-            let cse_var_1: int32 = (ry.outer.outer*3)
+        for (rc.outer.outer: int32, 0, 32) {
+          for (rx.outer.outer: int32, 0, 3) {
+            let cse_var_2: int32 = (rc.outer.outer*144)
+            let cse_var_1: int32 = (rc.outer.outer*784)
              {
-              attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              pad_temp.shared_1: Buffer(pad_temp.shared, float32, [504], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else(((((1 <= (floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer)) && ((floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              pad_temp.shared_1[(threadIdx.x_1 + 112)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 112), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 112), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 112), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
-              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              pad_temp.shared_1[(threadIdx.x_1 + 224)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 224), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 224), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 224), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
-              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              pad_temp.shared_1[(threadIdx.x_1 + 336)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 336), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 336), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 336), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
-              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              if @tir.likely((threadIdx.x_1 < 56), dtype=bool) {
-                pad_temp.shared_1[(threadIdx.x_1 + 448)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 448), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 448), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 448), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 224 {
+                pad_temp.shared_1: Buffer(pad_temp.shared, float32, [1008], [], scope="shared")[(threadIdx.x_1*2)] = @tir.if_then_else(((((7 <= floormod((threadIdx.x_1*2), 63)) && (floormod((threadIdx.x_1*2), 63) < 56)) && (1 <= (rx.outer.outer + floormod((threadIdx.x_1*2), 7)))) && ((rx.outer.outer + floormod((threadIdx.x_1*2), 7)) < 8)), data[((((cse_var_1 + (floordiv((threadIdx.x_1*2), 63)*49)) + rx.outer.outer) + floormod((threadIdx.x_1*2), 63)) - 8)], 0f32, dtype=float32)
+                pad_temp.shared_1[((threadIdx.x_1*2) + 1)] = @tir.if_then_else(((((7 <= floormod(((threadIdx.x_1*2) + 1), 63)) && (floormod(((threadIdx.x_1*2) + 1), 63) < 56)) && (1 <= (rx.outer.outer + floormod(((threadIdx.x_1*2) + 1), 7)))) && ((rx.outer.outer + floormod(((threadIdx.x_1*2) + 1), 7)) < 8)), data[((((cse_var_1 + (floordiv(((threadIdx.x_1*2) + 1), 63)*49)) + rx.outer.outer) + floormod(((threadIdx.x_1*2) + 1), 63)) - 8)], 0f32, dtype=float32)
               }
-              attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              kernel.shared_1: Buffer(kernel.shared, float32, [768], [], scope="shared")[threadIdx.x_2] = kernel[((((((blockIdx.x*147456) + (floordiv(threadIdx.x_2, 24)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
-              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              kernel.shared_1[(threadIdx.x_2 + 112)] = kernel[((((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 8) + 14), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 112), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
-              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              kernel.shared_1[(threadIdx.x_2 + 224)] = kernel[((((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 8) + 28), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 224), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
-              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              kernel.shared_1[(threadIdx.x_2 + 336)] = kernel[(((((((blockIdx.x*147456) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 64512)]
-              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              kernel.shared_1[(threadIdx.x_2 + 448)] = kernel[((((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 8) + 56), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 448), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
-              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              kernel.shared_1[(threadIdx.x_2 + 560)] = kernel[((((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 8) + 70), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 560), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
-              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
-              if @tir.likely((threadIdx.x_2 < 96), dtype=bool) {
-                kernel.shared_1[(threadIdx.x_2 + 672)] = kernel[(((((((blockIdx.x*147456) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 129024)]
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 224 {
+                pad_temp.shared_1[(((floordiv((floordiv((threadIdx.x_1*2), 7) + 64), 9)*63) + (floormod((floordiv((threadIdx.x_1*2), 7) + 1), 9)*7)) + floormod((threadIdx.x_1*2), 7))] = @tir.if_then_else(((((1 <= floormod((floordiv((threadIdx.x_1*2), 7) + 1), 9)) && (floormod((floordiv((threadIdx.x_1*2), 7) + 1), 9) < 8)) && (1 <= (rx.outer.outer + floormod((threadIdx.x_1*2), 7)))) && ((rx.outer.outer + floormod((threadIdx.x_1*2), 7)) < 8)), data[(((((cse_var_1 + (floordiv((floordiv((thr [...]
+                pad_temp.shared_1[(((floordiv((floordiv(((threadIdx.x_1*2) + 1), 7) + 64), 9)*63) + (floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 1), 9)*7)) + floormod(((threadIdx.x_1*2) + 1), 7))] = @tir.if_then_else(((((1 <= floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 1), 9)) && (floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 1), 9) < 8)) && (1 <= (rx.outer.outer + floormod(((threadIdx.x_1*2) + 1), 7)))) && ((rx.outer.outer + floormod(((threadIdx.x_1*2) + 1), 7)) < 8)), data [...]
               }
-              for (rc.outer.inner: int32, 0, 8) {
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9))]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 1)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9))]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-                conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-                conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-                conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-                conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-                conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-                conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 7)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-                conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-                conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-                conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-                conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-                conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-                conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 7)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 7)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 8)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-                conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-                conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-                conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-                conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-                conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 7)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-                conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 8)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 224 {
+                if @tir.likely((threadIdx.x_1 < 56), dtype=bool) {
+                  pad_temp.shared_1[(((floordiv((floordiv((threadIdx.x_1*2), 7) + 128), 9)*63) + (floormod((floordiv((threadIdx.x_1*2), 7) + 2), 9)*7)) + floormod((threadIdx.x_1*2), 7))] = @tir.if_then_else(((((1 <= floormod((floordiv((threadIdx.x_1*2), 7) + 2), 9)) && (floormod((floordiv((threadIdx.x_1*2), 7) + 2), 9) < 8)) && (1 <= (rx.outer.outer + floormod((threadIdx.x_1*2), 7)))) && ((rx.outer.outer + floormod((threadIdx.x_1*2), 7)) < 8)), data[(((((cse_var_1 + (floordiv((floordiv(( [...]
+                }
+                if @tir.likely((threadIdx.x_1 < 56), dtype=bool) {
+                  pad_temp.shared_1[(((floordiv((floordiv(((threadIdx.x_1*2) + 1), 7) + 128), 9)*63) + (floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 2), 9)*7)) + floormod(((threadIdx.x_1*2) + 1), 7))] = @tir.if_then_else(((((1 <= floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 2), 9)) && (floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 2), 9) < 8)) && (1 <= (rx.outer.outer + floormod(((threadIdx.x_1*2) + 1), 7)))) && ((rx.outer.outer + floormod(((threadIdx.x_1*2) + 1), 7)) < 8)), d [...]
+                }
+              }
+              attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 224;
+              kernel.shared_1: Buffer(kernel.shared, float32, [1536], [], scope="shared")[threadIdx.x_2] = kernel[(((((blockIdx.x*147456) + (floordiv(threadIdx.x_2, 48)*4608)) + cse_var_2) + (floormod(threadIdx.x_2, 48)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 224;
+              kernel.shared_1[(threadIdx.x_2 + 224)] = kernel[(((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 16) + 14), 3)*4608)) + cse_var_2) + (floormod((threadIdx.x_2 + 32), 48)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 224;
+              kernel.shared_1[(threadIdx.x_2 + 448)] = kernel[(((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 16) + 28), 3)*4608)) + cse_var_2) + (floormod((threadIdx.x_2 + 16), 48)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 224;
+              kernel.shared_1[(threadIdx.x_2 + 672)] = kernel[((((((blockIdx.x*147456) + (floordiv(floordiv(threadIdx.x_2, 16), 3)*4608)) + cse_var_2) + (floormod(threadIdx.x_2, 48)*3)) + rx.outer.outer) + 64512)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 224;
+              kernel.shared_1[(threadIdx.x_2 + 896)] = kernel[(((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 16) + 56), 3)*4608)) + cse_var_2) + (floormod((threadIdx.x_2 + 32), 48)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 224;
+              kernel.shared_1[(threadIdx.x_2 + 1120)] = kernel[(((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 16) + 70), 3)*4608)) + cse_var_2) + (floormod((threadIdx.x_2 + 16), 48)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 224;
+              if @tir.likely((threadIdx.x_2 < 192), dtype=bool) {
+                kernel.shared_1[(threadIdx.x_2 + 1344)] = kernel[((((((blockIdx.x*147456) + (floordiv(floordiv(threadIdx.x_2, 16), 3)*4608)) + cse_var_2) + (floormod(threadIdx.x_2, 48)*3)) + rx.outer.outer) + 129024)]
+              }
+              for (rc.outer.inner: int32, 0, 16) {
+                for (ry.outer.inner: int32, 0, 3) {
+                  conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+                  conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 7)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+                  conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 14)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+                  conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 21)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+                  conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 28)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+                  conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 35)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+                  conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 42)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+                }
               }
             }
           }
         }
-        for (i1.inner: int32, 0, 2) {
-          for (i3.inner: int32, 0, 7) {
-            compute[(((((blockIdx.x*1568) + (floordiv(threadIdx.x, 7)*98)) + (i1.inner*49)) + (floormod(threadIdx.x, 7)*7)) + i3.inner)] = max((conv2d_nchw_1[((i1.inner*7) + i3.inner)] + bias[(((blockIdx.x*32) + (floordiv(threadIdx.x, 7)*2)) + i1.inner)]), 0f32)
-          }
+        for (i2.inner: int32, 0, 7) {
+          compute[((((blockIdx.x*1568) + (floordiv(threadIdx.x, 7)*49)) + (i2.inner*7)) + floormod(threadIdx.x, 7))] = max((conv2d_nchw_1[i2.inner] + bias[((blockIdx.x*32) + floordiv(threadIdx.x, 7))]), 0f32)
         }
       }
     }
@@ -378,7 +338,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 0.276 ms
+    Execution time of this operator: 0.317 ms
 
 
 
@@ -423,35 +383,35 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     conv2d_nchw_nn_o_o_o_i, conv2d_nchw_nn_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_i, factor=1)
     conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_o_i, factor=1)
     conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=1)
-    conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=2)
-    conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=16)
+    conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=1)
+    conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=32)
     conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=1)
-    conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1)
+    conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=7)
     conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
-    conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=7)
+    conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=1)
     conv2d_nchw_yy_o_o_o_o, conv2d_nchw_yy_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_o_i, factor=1)
     conv2d_nchw_xx_o_i, conv2d_nchw_xx_i = s[conv2d_nchw].split(conv2d_nchw_xx, factor=1)
-    conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=7)
-    conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=1)
+    conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=1)
+    conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=7)
     conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
     conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=1)
-    conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=8)
+    conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=16)
     conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=1)
-    conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
+    conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=3)
     conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=1)
-    conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=3)
+    conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1)
     s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2 [...]
     compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
     compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
     compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
-    compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=2)
-    compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=16)
+    compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=1)
+    compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=32)
     compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
-    compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
-    compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
+    compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=7)
+    compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=1)
     compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
-    compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=7)
-    compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)
+    compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=1)
+    compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=7)
     compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
     s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
     s[conv2d_nchw].compute_at(s[compute], compute_i3_o_i)
@@ -471,14 +431,14 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
     kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
     s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-    kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)
+    kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=224)
     s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
     pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
-    pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
+    pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=2)
     s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-    pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)
+    pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=224)
     s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
-    s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 64)
+    s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 16)
     s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "unroll_explicit", True)
 
     CUDA source code:
@@ -496,10 +456,10 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
       #define int64_t long long
       #define uint64_t unsigned long long
     #endif
-    extern "C" __global__ void __launch_bounds__(112) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
-      float conv2d_nchw[14];
-      __shared__ float pad_temp_shared[504];
-      __shared__ float kernel_shared[768];
+    extern "C" __global__ void __launch_bounds__(224) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
+      float conv2d_nchw[7];
+      __shared__ float pad_temp_shared[1008];
+      __shared__ float kernel_shared[1536];
       conv2d_nchw[0] = 0.000000e+00f;
       conv2d_nchw[1] = 0.000000e+00f;
       conv2d_nchw[2] = 0.000000e+00f;
@@ -507,83 +467,44 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
       conv2d_nchw[4] = 0.000000e+00f;
       conv2d_nchw[5] = 0.000000e+00f;
       conv2d_nchw[6] = 0.000000e+00f;
-      conv2d_nchw[7] = 0.000000e+00f;
-      conv2d_nchw[8] = 0.000000e+00f;
-      conv2d_nchw[9] = 0.000000e+00f;
-      conv2d_nchw[10] = 0.000000e+00f;
-      conv2d_nchw[11] = 0.000000e+00f;
-      conv2d_nchw[12] = 0.000000e+00f;
-      conv2d_nchw[13] = 0.000000e+00f;
-      for (int rc_outer_outer = 0; rc_outer_outer < 64; ++rc_outer_outer) {
-        for (int ry_outer_outer = 0; ry_outer_outer < 3; ++ry_outer_outer) {
+      for (int rc_outer_outer = 0; rc_outer_outer < 32; ++rc_outer_outer) {
+        for (int rx_outer_outer = 0; rx_outer_outer < 3; ++rx_outer_outer) {
           __syncthreads();
-          pad_temp_shared[((int)threadIdx.x)] = (((((1 <= (((((int)threadIdx.x) % 63) / 9) + ry_outer_outer)) && ((((((int)threadIdx.x) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 392) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
-          pad_temp_shared[(((int)threadIdx.x) + 112)] = (((((1 <= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 392) + (((((int)threadIdx.x) + 112) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
-          pad_temp_shared[(((int)threadIdx.x) + 224)] = (((((1 <= ((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 392) + (((((int)threadIdx.x) + 224) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
-          pad_temp_shared[(((int)threadIdx.x) + 336)] = (((((1 <= ((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 392) + (((((int)threadIdx.x) + 336) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) * 2)] = (((((7 <= ((((int)threadIdx.x) * 2) % 63)) && (((((int)threadIdx.x) * 2) % 63) < 56)) && (1 <= (rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)))) && ((rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) * 2) / 63) * 49)) + rx_outer_outer) + ((((int)threadIdx.x) * 2) % 63)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[((((int)threadIdx.x) * 2) + 1)] = (((((7 <= (((((int)threadIdx.x) * 2) + 1) % 63)) && ((((((int)threadIdx.x) * 2) + 1) % 63) < 56)) && (1 <= (rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)))) && ((rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)) < 8)) ? data[(((((rc_outer_outer * 784) + ((((((int)threadIdx.x) * 2) + 1) / 63) * 49)) + rx_outer_outer) + (((((int)threadIdx.x) * 2) + 1) % 63)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[((((((((int)threadIdx.x) * 2) + 448) / 63) * 63) + (((((((int)threadIdx.x) * 2) / 7) + 1) % 9) * 7)) + ((((int)threadIdx.x) * 2) % 7))] = (((((1 <= ((((((int)threadIdx.x) * 2) / 7) + 1) % 9)) && (((((((int)threadIdx.x) * 2) / 7) + 1) % 9) < 8)) && (1 <= (rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)))) && ((rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + ((((((int)threadIdx.x) * 2) + 448) / 63) * 49)) + (((((((int)th [...]
+          pad_temp_shared[((((((((int)threadIdx.x) * 2) + 449) / 63) * 63) + ((((((((int)threadIdx.x) * 2) + 1) / 7) + 1) % 9) * 7)) + (((((int)threadIdx.x) * 2) + 1) % 7))] = (((((1 <= (((((((int)threadIdx.x) * 2) + 1) / 7) + 1) % 9)) && ((((((((int)threadIdx.x) * 2) + 1) / 7) + 1) % 9) < 8)) && (1 <= (rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)))) && ((rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + ((((((int)threadIdx.x) * 2) [...]
           if (((int)threadIdx.x) < 56) {
-            pad_temp_shared[(((int)threadIdx.x) + 448)] = (((((1 <= ((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 392) + (((((int)threadIdx.x) + 448) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+            pad_temp_shared[((((((((int)threadIdx.x) * 2) + 896) / 63) * 63) + (((((((int)threadIdx.x) * 2) / 7) + 2) % 9) * 7)) + ((((int)threadIdx.x) * 2) % 7))] = (((((1 <= ((((((int)threadIdx.x) * 2) / 7) + 2) % 9)) && (((((((int)threadIdx.x) * 2) / 7) + 2) % 9) < 8)) && (1 <= (rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)))) && ((rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + ((((((int)threadIdx.x) * 2) + 896) / 63) * 49)) + (((((((int) [...]
           }
-          kernel_shared[((int)threadIdx.x)] = kernel[((((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
-          kernel_shared[(((int)threadIdx.x) + 112)] = kernel[((((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 112) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-          kernel_shared[(((int)threadIdx.x) + 224)] = kernel[((((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 224) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-          kernel_shared[(((int)threadIdx.x) + 336)] = kernel[(((((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 64512)];
-          kernel_shared[(((int)threadIdx.x) + 448)] = kernel[((((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 448) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-          kernel_shared[(((int)threadIdx.x) + 560)] = kernel[((((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 560) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-          if (((int)threadIdx.x) < 96) {
-            kernel_shared[(((int)threadIdx.x) + 672)] = kernel[(((((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 129024)];
+          if (((int)threadIdx.x) < 56) {
+            pad_temp_shared[((((((((int)threadIdx.x) * 2) + 897) / 63) * 63) + ((((((((int)threadIdx.x) * 2) + 1) / 7) + 2) % 9) * 7)) + (((((int)threadIdx.x) * 2) + 1) % 7))] = (((((1 <= (((((((int)threadIdx.x) * 2) + 1) / 7) + 2) % 9)) && ((((((((int)threadIdx.x) * 2) + 1) / 7) + 2) % 9) < 8)) && (1 <= (rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)))) && ((rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + ((((((int)threadIdx.x) *  [...]
+          }
+          kernel_shared[((int)threadIdx.x)] = kernel[(((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) % 48) * 3)) + rx_outer_outer)];
+          kernel_shared[(((int)threadIdx.x) + 224)] = kernel[(((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 224) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 32) % 48) * 3)) + rx_outer_outer)];
+          kernel_shared[(((int)threadIdx.x) + 448)] = kernel[(((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 448) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 16) % 48) * 3)) + rx_outer_outer)];
+          kernel_shared[(((int)threadIdx.x) + 672)] = kernel[((((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) % 48) * 3)) + rx_outer_outer) + 64512)];
+          kernel_shared[(((int)threadIdx.x) + 896)] = kernel[(((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 896) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 32) % 48) * 3)) + rx_outer_outer)];
+          kernel_shared[(((int)threadIdx.x) + 1120)] = kernel[(((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 1120) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 16) % 48) * 3)) + rx_outer_outer)];
+          if (((int)threadIdx.x) < 192) {
+            kernel_shared[(((int)threadIdx.x) + 1344)] = kernel[((((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) % 48) * 3)) + rx_outer_outer) + 129024)];
           }
           __syncthreads();
-          for (int rc_outer_inner = 0; rc_outer_inner < 8; ++rc_outer_inner) {
-            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9))] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 1)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-            conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-            conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-            conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-            conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-            conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-            conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 7)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-            conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-            conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-            conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-            conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-            conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-            conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 7)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 7)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 8)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-            conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-            conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-            conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-            conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-            conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 7)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-            conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 8)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
+          for (int rc_outer_inner = 0; rc_outer_inner < 16; ++rc_outer_inner) {
+            for (int ry_outer_inner = 0; ry_outer_inner < 3; ++ry_outer_inner) {
+              conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+              conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 7)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+              conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 14)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+              conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 21)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+              conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 28)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+              conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 35)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+              conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 42)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+            }
           }
         }
       }
-      for (int i1_inner = 0; i1_inner < 2; ++i1_inner) {
-        for (int i3_inner = 0; i3_inner < 7; ++i3_inner) {
-          compute[(((((((int)blockIdx.x) * 1568) + ((((int)threadIdx.x) / 7) * 98)) + (i1_inner * 49)) + ((((int)threadIdx.x) % 7) * 7)) + i3_inner)] = max((conv2d_nchw[((i1_inner * 7) + i3_inner)] + bias[(((((int)blockIdx.x) * 32) + ((((int)threadIdx.x) / 7) * 2)) + i1_inner)]), 0.000000e+00f);
-        }
+      for (int i2_inner = 0; i2_inner < 7; ++i2_inner) {
+        compute[((((((int)blockIdx.x) * 1568) + ((((int)threadIdx.x) / 7) * 49)) + (i2_inner * 7)) + (((int)threadIdx.x) % 7))] = max((conv2d_nchw[i2_inner] + bias[((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 7))]), 0.000000e+00f);
       }
     }
 
@@ -642,7 +563,7 @@ In the example below we resume the status and do more 5 trials.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  19.285 seconds)
+   **Total running time of the script:** ( 2 minutes  20.981 seconds)
 
 
 .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
index 472103393..45ae15b01 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
@@ -614,7 +614,7 @@ so we can read the log file and load the best schedules.
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-       9.8262       9.8166       9.8458       9.8162       0.0139   
+       9.6649       9.6723       9.6914       9.6310       0.0252   
                
 
 
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
index b07d60d3c..bcaab3f65 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
@@ -633,7 +633,7 @@ so we can read the log file and load the best schedules.
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      779.4330     778.2617     785.7413     774.2961      4.7454   
+      760.7450     763.7720     763.9236     754.5393      4.3885   
                
 
 
@@ -658,7 +658,7 @@ Other Tips
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  20.977 seconds)
+   **Total running time of the script:** ( 1 minutes  19.068 seconds)
 
 
 .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_x86.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
index 931e9e548..a7f257bf5 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
@@ -362,73 +362,29 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
                  placeholder_4: Buffer(placeholder_14: Pointer(float32), float32, [65536], []),
                  compute: Buffer(compute_2: Pointer(float32), float32, [65536], [])}
       buffer_map = {placeholder_5: placeholder, placeholder_6: placeholder_1, placeholder_7: placeholder_2, placeholder_8: placeholder_3, placeholder_9: placeholder_4, compute_1: compute} {
-      for (i0.outer.i1.outer.fused: int32, 0, 256) "parallel" {
-        allocate(compute_3: Pointer(global float32), float32, [256]), storage_scope = global {
-          for (i.inner.init: int32, 0, 16) {
-            let cse_var_1: int32 = (i.inner.init*16)
-             {
-              compute_4: Buffer(compute_3, float32, [256], [])[cse_var_1] = 0f32
-              compute_4[(cse_var_1 + 1)] = 0f32
-              compute_4[(cse_var_1 + 2)] = 0f32
-              compute_4[(cse_var_1 + 3)] = 0f32
-              compute_4[(cse_var_1 + 4)] = 0f32
-              compute_4[(cse_var_1 + 5)] = 0f32
-              compute_4[(cse_var_1 + 6)] = 0f32
-              compute_4[(cse_var_1 + 7)] = 0f32
-              compute_4[(cse_var_1 + 8)] = 0f32
-              compute_4[(cse_var_1 + 9)] = 0f32
-              compute_4[(cse_var_1 + 10)] = 0f32
-              compute_4[(cse_var_1 + 11)] = 0f32
-              compute_4[(cse_var_1 + 12)] = 0f32
-              compute_4[(cse_var_1 + 13)] = 0f32
-              compute_4[(cse_var_1 + 14)] = 0f32
-              compute_4[(cse_var_1 + 15)] = 0f32
-            }
-          }
-          for (elem_idx: int32, 0, let cse_var_2: int32 = floormod(i0.outer.i1.outer.fused, 32) in (placeholder_3[(cse_var_2 + 1)] - placeholder_3[cse_var_2])) {
-            for (i.inner: int32, 0, 16) {
-              let cse_var_21: int32 = floormod(i0.outer.i1.outer.fused, 32)
-              let cse_var_20: int32 = (i.inner*16)
-              let cse_var_19: int32 = (elem_idx*16)
-              let cse_var_18: int32 = (cse_var_20 + 10)
-              let cse_var_17: int32 = (cse_var_20 + 11)
-              let cse_var_16: int32 = (cse_var_20 + 12)
-              let cse_var_15: int32 = (cse_var_20 + 13)
-              let cse_var_14: int32 = (cse_var_20 + 14)
-              let cse_var_13: int32 = (cse_var_20 + 15)
-              let cse_var_12: int32 = (cse_var_20 + 2)
-              let cse_var_11: int32 = (cse_var_20 + 3)
-              let cse_var_10: int32 = (cse_var_20 + 4)
-              let cse_var_9: int32 = (cse_var_20 + 5)
-              let cse_var_8: int32 = (cse_var_20 + 6)
-              let cse_var_7: int32 = (cse_var_20 + 7)
-              let cse_var_6: int32 = (cse_var_20 + 8)
-              let cse_var_5: int32 = (cse_var_20 + 9)
-              let cse_var_4: int32 = (cse_var_20 + 1)
-              let cse_var_3: int32 = ((floordiv(i0.outer.i1.outer.fused, 32)*4096) + (i.inner*256))
-               {
-                compute_4[cse_var_20] = (compute_4[cse_var_20] + (placeholder_1[((placeholder_3[cse_var_21]*16) + cse_var_19)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_4] = (compute_4[cse_var_4] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 1)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_12] = (compute_4[cse_var_12] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 2)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_11] = (compute_4[cse_var_11] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 3)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_10] = (compute_4[cse_var_10] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 4)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_9] = (compute_4[cse_var_9] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 5)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_8] = (compute_4[cse_var_8] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 6)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_7] = (compute_4[cse_var_7] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 7)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_6] = (compute_4[cse_var_6] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 8)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_5] = (compute_4[cse_var_5] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 9)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_18] = (compute_4[cse_var_18] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 10)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_17] = (compute_4[cse_var_17] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 11)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_16] = (compute_4[cse_var_16] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 12)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_15] = (compute_4[cse_var_15] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 13)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_14] = (compute_4[cse_var_14] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 14)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-                compute_4[cse_var_13] = (compute_4[cse_var_13] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 15)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
+      for (i0.outer.i1.outer.fused: int32, 0, 16) "parallel" {
+        allocate(compute_3: Pointer(global float32), float32, [4096]), storage_scope = global {
+          for (i.outer.inner: int32, 0, 8) {
+            for (nb_j.inner: int32, 0, 2) {
+              for (i.inner.init: int32, 0, 16) {
+                for (j.init: int32, 0, 16) {
+                  compute_4: Buffer(compute_3, float32, [4096], [])[((((i.outer.inner*512) + (i.inner.init*32)) + (nb_j.inner*16)) + j.init)] = 0f32
+                }
+              }
+              for (elem_idx: int32, 0, let cse_var_1: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner) in (placeholder_3[(cse_var_1 + 1)] - placeholder_3[cse_var_1])) {
+                for (i.inner: int32, 0, 16) {
+                  for (j: int32, 0, 16) {
+                    let cse_var_3: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner)
+                    let cse_var_2: int32 = ((((i.outer.inner*512) + (i.inner*32)) + (nb_j.inner*16)) + j)
+                    compute_4[cse_var_2] = (compute_4[cse_var_2] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + j)]*max(placeholder[(((i.outer.inner*4096) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
+                  }
+                }
               }
             }
           }
-          for (i0.inner: int32, 0, 16) {
-            let cse_var_22: int32 = (((floordiv(i0.outer.i1.outer.fused, 32)*8192) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 32)*16))
-            compute[ramp(cse_var_22, 1, 16)] = max((compute_4[ramp((i0.inner*16), 1, 16)] + placeholder_4[ramp(cse_var_22, 1, 16)]), broadcast(0f32, 16))
+          for (i0.inner: int32, 0, 128) {
+            let cse_var_4: int32 = ((i0.inner*512) + (i0.outer.i1.outer.fused*32))
+            compute[ramp(cse_var_4, 1, 32)] = max((compute_4[ramp((i0.inner*32), 1, 32)] + placeholder_4[ramp(cse_var_4, 1, 32)]), broadcast(0f32, 32))
           }
         }
       }
@@ -482,7 +438,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 1.919 ms
+    Execution time of this operator: 1.497 ms
 
 
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
index 16ca3d41b..62eb2d307 100644
--- a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
@@ -5,10 +5,10 @@
 
 Computation times
 =================
-**00:43.833** total execution time for **how_to_tune_with_autotvm** files:
+**00:43.875** total execution time for **how_to_tune_with_autotvm** files:
 
-- **00:42.987**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)
-- **00:00.221**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)
-- **00:00.210**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_arm.py` (``tune_relay_arm.py``)
-- **00:00.207**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_cuda.py` (``tune_relay_cuda.py``)
-- **00:00.207**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``)
+- **00:43.044**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)
+- **00:00.220**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)
+- **00:00.205**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``)
+- **00:00.204**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_cuda.py` (``tune_relay_cuda.py``)
+- **00:00.202**: :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_arm.py` (``tune_relay_arm.py``)
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
index e0d91042c..7519dbecb 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
@@ -859,8 +859,8 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 4, 32]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2885496
-    No: 6   GFLOPS: 63.27/63.27     result: MeasureResult(costs=(0.0036591528999999996,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.580784797668457, timestamp=1650045397.8798823)       [('tile_f', [-1, 1, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3754080
-    No: 7   GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 6   GFLOPS: 42.32/42.32     result: MeasureResult(costs=(0.005470547157894737,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.5707504749298096, timestamp=1650045818.0508544)       [('tile_f', [-1, 1, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3754080
+    No: 7   GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -983,7 +983,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 16, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 256, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6225319
-    No: 8   GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 8   GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1106,7 +1106,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 1, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 8, 64]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,943546
-    No: 9   GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 9   GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1229,7 +1229,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 16, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 16, 32]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2868708
-    No: 10  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 10  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 142, in build
         res = future.result()
       File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
@@ -1247,7 +1247,7 @@ for this template
     TimeoutError
 
             [('tile_f', [-1, 32, 2, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 4, 2]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4691833
-    No: 11  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 11  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1370,7 +1370,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 2, 64]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,1042124
-    No: 12  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 12  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1493,7 +1493,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 32, 1, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 32, 16]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,10013405
-    No: 13  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 13  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1616,7 +1616,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 8, 8, 2]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 32]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6732082
-    No: 14  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 14  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1739,7 +1739,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 4, 32]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7536735
-    No: 15  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 15  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1862,7 +1862,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 1, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 128, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,482121
-    No: 16  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 16  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -1985,7 +1985,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 1, 16]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 32, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2824525
-    No: 17  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 17  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -2108,7 +2108,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 64, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 8, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4559286
-    No: 18  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 18  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 571, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 523, in _build_func_common
@@ -2231,7 +2231,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 854, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 32, 16]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 512]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9677544
-    No: 19  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+    No: 19  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 721, in __call__
         yield remote, remote.load_module(os.path.split(build_result.filename)[1])
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 685, in run_through_rpc
@@ -2319,7 +2319,7 @@ for this template
       15: _PyEval_EvalFrameDefault
       14: 0x0000000000537c30
       13: _PyObject_FastCallKeywords
-      12: 0x00007f8505170fa2
+      12: 0x00007f0a9c06afa2
       11: _ctypes_callproc
       10: ffi_call
       9: ffi_call_unix64
@@ -2384,7 +2384,7 @@ for this template
       21: _PyFunction_FastCallKeywords
       20: _PyEval_EvalFrameDefault
       19: _PyFunction_FastCall      [('tile_f', [-1, 8, 2, 16]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6390073
-    No: 20  GFLOPS: 145.17/145.17   result: MeasureResult(costs=(0.00159474064,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4268317222595215, timestamp=1650045424.2071035)      [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
+    No: 20  GFLOPS: 144.59/144.59   result: MeasureResult(costs=(0.0016010948300000003,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4313409328460693, timestamp=1650045844.3790898)      [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
 
 
 
@@ -2437,7 +2437,7 @@ and measure running time.
 
     Best config:
     [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
-    Time cost of this operator: 0.001982
+    Time cost of this operator: 0.001960
 
 
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
index f643eeb0d..3fcd02494 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
@@ -292,10 +292,10 @@ Timing the untuned program
     ########## Build without Autotuning ##########
     Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  
     ---------                                     ---                                           --------  -------  -----              ------  -------  
-    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  313.6     98.771   (1, 2, 10, 10, 3)  2       1        
-    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.0       0.945    (1, 6, 10, 10)     1       1        
-    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.901     0.284    (1, 1, 10, 10, 3)  1       1        
-    Total_time                                    -                                             317.501   -        -                  -       -        
+    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  334.0     98.794   (1, 2, 10, 10, 3)  2       1        
+    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.131     0.926    (1, 6, 10, 10)     1       1        
+    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.944     0.279    (1, 1, 10, 10, 3)  1       1        
+    Total_time                                    -                                             338.076   -        -                  -       -        
 
 
 
@@ -357,10 +357,10 @@ Timing the tuned program
     ########## Build with Autotuning ##########
     Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  
     ---------                                     ---                                           --------  -------  -----              ------  -------  
-    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  81.2      96.83    (1, 6, 10, 10, 1)  2       1        
-    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.738     2.072    (1, 6, 10, 10)     1       1        
-    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.92      1.097    (1, 1, 10, 10, 3)  1       1        
-    Total_time                                    -                                             83.858    -        -                  -       -        
+    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  81.3      96.795   (1, 6, 10, 10, 1)  2       1        
+    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.768     2.105    (1, 6, 10, 10)     1       1        
+    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.924     1.1      (1, 1, 10, 10, 3)  1       1        
+    Total_time                                    -                                             83.992    -        -                  -       -        
 
 
 
diff --git a/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
index 6e248e116..71b167350 100644
--- a/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
@@ -5,10 +5,10 @@
 
 Computation times
 =================
-**00:44.076** total execution time for **how_to_work_with_microtvm** files:
+**00:43.541** total execution time for **how_to_work_with_microtvm** files:
 
-- **00:40.085**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_autotune.py` (``micro_autotune.py``)
-- **00:03.422**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_tflite.py` (``micro_tflite.py``)
-- **00:00.194**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_ethosu.py` (``micro_ethosu.py``)
-- **00:00.194**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_tvmc.py` (``micro_tvmc.py``)
+- **00:39.586**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_autotune.py` (``micro_autotune.py``)
+- **00:03.405**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_tflite.py` (``micro_tflite.py``)
+- **00:00.187**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_ethosu.py` (``micro_ethosu.py``)
+- **00:00.182**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_tvmc.py` (``micro_tvmc.py``)
 - **00:00.181**: :ref:`sphx_glr_how_to_work_with_microtvm_micro_reference_vm.py` (``micro_reference_vm.py``)
diff --git a/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
index 35c48c07d..38f9e3a33 100644
--- a/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
@@ -5,8 +5,8 @@
 
 Computation times
 =================
-**00:08.594** total execution time for **how_to_work_with_relay** files:
+**00:08.670** total execution time for **how_to_work_with_relay** files:
 
-- **00:06.823**: :ref:`sphx_glr_how_to_work_with_relay_using_external_lib.py` (``using_external_lib.py``)
-- **00:01.563**: :ref:`sphx_glr_how_to_work_with_relay_build_gcn.py` (``build_gcn.py``)
-- **00:00.208**: :ref:`sphx_glr_how_to_work_with_relay_using_relay_viz.py` (``using_relay_viz.py``)
+- **00:06.816**: :ref:`sphx_glr_how_to_work_with_relay_using_external_lib.py` (``using_external_lib.py``)
+- **00:01.650**: :ref:`sphx_glr_how_to_work_with_relay_build_gcn.py` (``build_gcn.py``)
+- **00:00.204**: :ref:`sphx_glr_how_to_work_with_relay_using_relay_viz.py` (``using_relay_viz.py``)
diff --git a/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
index 984af3b9f..5b80e77f1 100644
--- a/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
@@ -5,13 +5,13 @@
 
 Computation times
 =================
-**00:05.519** total execution time for **how_to_work_with_schedules** files:
+**00:05.359** total execution time for **how_to_work_with_schedules** files:
 
-- **00:02.027**: :ref:`sphx_glr_how_to_work_with_schedules_intrin_math.py` (``intrin_math.py``)
-- **00:01.104**: :ref:`sphx_glr_how_to_work_with_schedules_tensorize.py` (``tensorize.py``)
-- **00:00.711**: :ref:`sphx_glr_how_to_work_with_schedules_reduction.py` (``reduction.py``)
-- **00:00.693**: :ref:`sphx_glr_how_to_work_with_schedules_scan.py` (``scan.py``)
-- **00:00.307**: :ref:`sphx_glr_how_to_work_with_schedules_extern_op.py` (``extern_op.py``)
-- **00:00.233**: :ref:`sphx_glr_how_to_work_with_schedules_tedd.py` (``tedd.py``)
-- **00:00.227**: :ref:`sphx_glr_how_to_work_with_schedules_schedule_primitives.py` (``schedule_primitives.py``)
-- **00:00.217**: :ref:`sphx_glr_how_to_work_with_schedules_tuple_inputs.py` (``tuple_inputs.py``)
+- **00:02.004**: :ref:`sphx_glr_how_to_work_with_schedules_intrin_math.py` (``intrin_math.py``)
+- **00:01.049**: :ref:`sphx_glr_how_to_work_with_schedules_tensorize.py` (``tensorize.py``)
+- **00:00.694**: :ref:`sphx_glr_how_to_work_with_schedules_reduction.py` (``reduction.py``)
+- **00:00.675**: :ref:`sphx_glr_how_to_work_with_schedules_scan.py` (``scan.py``)
+- **00:00.292**: :ref:`sphx_glr_how_to_work_with_schedules_extern_op.py` (``extern_op.py``)
+- **00:00.222**: :ref:`sphx_glr_how_to_work_with_schedules_schedule_primitives.py` (``schedule_primitives.py``)
+- **00:00.218**: :ref:`sphx_glr_how_to_work_with_schedules_tedd.py` (``tedd.py``)
+- **00:00.203**: :ref:`sphx_glr_how_to_work_with_schedules_tuple_inputs.py` (``tuple_inputs.py``)
diff --git a/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt b/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
index 25f94327a..7b797747b 100644
--- a/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
+++ b/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
@@ -314,7 +314,7 @@ The importing needs to happen before the tensorized GEMV being executed.
                  B: Buffer(B_2: Pointer(float32), float32, [32768], []),
                  C: Buffer(C_2: Pointer(float32), float32, [524288], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C} {
-      attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmpm85nq5e4/input0.cc'\nsource_filename = \"/tmp/tmpm85nq5e4/input0.cc\"\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = alloca float*, align 8\n  %8 = alloca float*, align 8\n  %9 = alloca floa [...]
+      attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmpra43z4sb/input0.cc'\nsource_filename = \"/tmp/tmpra43z4sb/input0.cc\"\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = alloca float*, align 8\n  %8 = alloca float*, align 8\n  %9 = alloca floa [...]
       for (i, 0, 1024) {
         for (j.outer: int32, 0, 32) {
           @tir.call_extern("gemv_update", @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), C_2, ((i*512) + (j.outer*16)), 16, 2, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), A_2, (i*64), 64, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), B_2, (j.outer*1024), 1024, 1, dtype=handle), 16, 64, 64, dtype=int32)
diff --git a/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
index 773b0dcdb..9607845c3 100644
--- a/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
 
 Computation times
 =================
-**00:21.331** total execution time for **topic_vta_tutorials_autotvm** files:
+**00:20.124** total execution time for **topic_vta_tutorials_autotvm** files:
 
-- **00:21.126**: :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_relay_vta.py` (``tune_relay_vta.py``)
-- **00:00.205**: :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_alu_vta.py` (``tune_alu_vta.py``)
+- **00:19.933**: :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_relay_vta.py` (``tune_relay_vta.py``)
+- **00:00.191**: :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_alu_vta.py` (``tune_alu_vta.py``)
diff --git a/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
index 555f84014..648a2fbdd 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
@@ -265,7 +265,7 @@ The compilation steps are:
       DeprecationWarning,
     /workspace/vta/tutorials/frontend/deploy_classification.py:213: DeprecationWarning: legacy graph executor behavior of producing json / lib / params will be removed in the next release. Please see documents of tvm.contrib.graph_executor.GraphModule for the  new recommended usage.
       relay_prog, target=tvm.target.Target(target, host=env.target_host), params=params
-    resnet18_v1 inference graph built in 22.20s!
+    resnet18_v1 inference graph built in 21.12s!
 
 
 
diff --git a/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
index 0ffe60909..22fddda61 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
@@ -301,7 +301,7 @@ The compilation steps are:
 
     /workspace/python/tvm/relay/build_module.py:439: DeprecationWarning: Please use input parameter mod (tvm.IRModule) instead of deprecated parameter mod (tvm.relay.function.Function)
       DeprecationWarning,
-    yolov3-tiny inference graph built in 15.38s!
+    yolov3-tiny inference graph built in 14.80s!
 
 
 
diff --git a/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
index 656ee2871..c62329930 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
 
 Computation times
 =================
-**01:30.190** total execution time for **topic_vta_tutorials_frontend** files:
+**01:27.702** total execution time for **topic_vta_tutorials_frontend** files:
 
-- **00:47.943**: :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_detection.py` (``deploy_detection.py``)
-- **00:42.248**: :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_classification.py` (``deploy_classification.py``)
+- **00:46.683**: :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_detection.py` (``deploy_detection.py``)
+- **00:41.019**: :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_classification.py` (``deploy_classification.py``)
diff --git a/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
index 1797fcf84..215d6376e 100644
--- a/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
 
 Computation times
 =================
-**00:03.503** total execution time for **topic_vta_tutorials_optimize** files:
+**00:03.486** total execution time for **topic_vta_tutorials_optimize** files:
 
-- **00:02.968**: :ref:`sphx_glr_topic_vta_tutorials_optimize_convolution_opt.py` (``convolution_opt.py``)
-- **00:00.535**: :ref:`sphx_glr_topic_vta_tutorials_optimize_matrix_multiply_opt.py` (``matrix_multiply_opt.py``)
+- **00:02.967**: :ref:`sphx_glr_topic_vta_tutorials_optimize_convolution_opt.py` (``convolution_opt.py``)
+- **00:00.518**: :ref:`sphx_glr_topic_vta_tutorials_optimize_matrix_multiply_opt.py` (``matrix_multiply_opt.py``)
diff --git a/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
index 6fb7f400d..8bc1c5444 100644
--- a/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
 
 Computation times
 =================
-**00:00.964** total execution time for **topic_vta_tutorials** files:
+**00:00.925** total execution time for **topic_vta_tutorials** files:
 
-- **00:00.490**: :ref:`sphx_glr_topic_vta_tutorials_matrix_multiply.py` (``matrix_multiply.py``)
-- **00:00.474**: :ref:`sphx_glr_topic_vta_tutorials_vta_get_started.py` (``vta_get_started.py``)
+- **00:00.463**: :ref:`sphx_glr_topic_vta_tutorials_matrix_multiply.py` (``matrix_multiply.py``)
+- **00:00.462**: :ref:`sphx_glr_topic_vta_tutorials_vta_get_started.py` (``vta_get_started.py``)
diff --git a/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt b/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
index 22c965241..58becc111 100644
--- a/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
+++ b/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
@@ -305,7 +305,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 94.001 ms
+    Execution time of this operator: 93.880 ms
 
 
 
@@ -414,6 +414,11 @@ Expression (TE) language that demonstrates how TVM can optimize computational
 operations.
 
 
+.. rst-class:: sphx-glr-timing
+
+   **Total running time of the script:** ( 1 minutes  0.100 seconds)
+
+
 .. _sphx_glr_download_tutorial_auto_scheduler_matmul_x86.py:
 
 
diff --git a/docs/_sources/tutorial/autotvm_relay_x86.rst.txt b/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
index a5913f093..19cc01634 100644
--- a/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
+++ b/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
@@ -268,7 +268,7 @@ standard deviation.
 
  .. code-block:: none
 
-    {'mean': 497.8195587100008, 'median': 497.8889928000001, 'std': 1.2998905415882913}
+    {'mean': 490.83991727999893, 'median': 491.0158592500011, 'std': 0.36641131670863597}
 
 
 
@@ -482,31 +482,31 @@ the tuning data to.
 
  .. code-block:: none
 
-
    [Task  1/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  1/25]  Current/Best:   16.09/  23.78 GFLOPS | Progress: (4/10) | 5.97 s
    [Task  1/25]  Current/Best:   12.38/  23.78 GFLOPS | Progress: (8/10) | 9.09 s
    [Task  1/25]  Current/Best:   12.87/  23.78 GFLOPS | Progress: (10/10) | 10.51 s Done.
-
    [Task  2/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  2/25]  Current/Best:    6.93/  19.90 GFLOPS | Progress: (4/10) | 2.34 s
    [Task  2/25]  Current/Best:   12.59/  19.90 GFLOPS | Progress: (8/10) | 5.20 s
    [Task  2/25]  Current/Best:   13.76/  19.90 GFLOPS | Progress: (10/10) | 5.93 s Done.
-
    [Task  3/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  3/25]  Current/Best:   18.47/  20.88 GFLOPS | Progress: (4/10) | 4.04 s
    [Task  3/25]  Current/Best:   12.51/  23.11 GFLOPS | Progress: (8/10) | 5.68 s
    [Task  3/25]  Current/Best:   12.42/  23.11 GFLOPS | Progress: (10/10) | 6.59 s Done.
-
    [Task  4/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  4/25]  Current/Best:   15.14/  17.38 GFLOPS | Progress: (4/10) | 2.40 s
    [Task  4/25]  Current/Best:   12.29/  17.38 GFLOPS | Progress: (8/10) | 4.69 s
    [Task  4/25]  Current/Best:   17.37/  17.38 GFLOPS | Progress: (10/10) | 5.50 s Done.
-
    [Task  5/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  5/25]  Current/Best:   20.51/  20.51 GFLOPS | Progress: (4/10) | 3.74 s
    [Task  5/25]  Current/Best:   18.37/  22.59 GFLOPS | Progress: (8/10) | 5.51 s
    [Task  5/25]  Current/Best:    1.71/  22.59 GFLOPS | Progress: (10/10) | 7.02 s Done.
-
    [Task  6/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  6/25]  Current/Best:   12.60/  12.60 GFLOPS | Progress: (4/10) | 4.01 s
    [Task  6/25]  Current/Best:   14.84/  18.65 GFLOPS | Progress: (8/10) | 6.45 s
    [Task  6/25]  Current/Best:   13.74/  18.65 GFLOPS | Progress: (10/10) | 8.08 s Done.
-
    [Task  7/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  7/25]  Current/Best:   15.90/  16.23 GFLOPS | Progress: (4/10) | 3.29 s
    [Task  7/25]  Current/Best:   17.09/  17.09 GFLOPS | Progress: (8/10) | 5.05 s
    [Task  7/25]  Current/Best:   15.77/  17.09 GFLOPS | Progress: (10/10) | 6.06 s Done.
-
    [Task  8/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  8/25]  Current/Best:   22.54/  22.54 GFLOPS | Progress: (4/10) | 7.65 s
    [Task  8/25]  Current/Best:    3.82/  22.54 GFLOPS | Progress: (8/10) | 11.81 s
    [Task  8/25]  Current/Best:   10.26/  22.54 GFLOPS | Progress: (10/10) | 17.23 s Done.
-
    [Task  9/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  9/25]  Current/Best:   18.40/  18.40 GFLOPS | Progress: (4/10) | 2.35 s
    [Task  9/25]  Current/Best:   21.66/  21.66 GFLOPS | Progress: (8/10) | 4.34 s
    [Task  9/25]  Current/Best:   17.37/  21.66 GFLOPS | Progress: (10/10) | 5.07 s Done.
-
    [Task 10/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 10/25]  Current/Best:    5.55/  12.46 GFLOPS | Progress: (4/10) | 3.41 s
    [Task 10/25]  Current/Best:   17.16/  21.07 GFLOPS | Progress: (8/10) | 5.97 s
    [Task 10/25]  Current/Best:   18.13/  21.07 GFLOPS | Progress: (10/10) | 6.62 s Done.
-
    [Task 11/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 11/25]  Current/Best:   24.34/  24.34 GFLOPS | Progress: (4/10) | 2.87 s
    [Task 11/25]  Current/Best:   12.13/  24.34 GFLOPS | Progress: (8/10) | 5.40 s
    [Task 11/25]  Current/Best:   14.89/  24.34 GFLOPS | Progress: (10/10) | 6.54 s Done.
-
    [Task 12/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 12/25]  Current/Best:   15.22/  21.11 GFLOPS | Progress: (4/10) | 2.89 s
    [Task 12/25]  Current/Best:    4.69/  21.11 GFLOPS | Progress: (8/10) | 4.87 s
    [Task 12/25]  Current/Best:   15.17/  21.11 GFLOPS | Progress: (10/10) | 5.81 s Done.
-
    [Task 13/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 13/25]  Current/Best:   19.04/  20.57 GFLOPS | Progress: (4/10) | 3.67 s
    [Task 13/25]  Current/Best:   18.20/  21.00 GFLOPS | Progress: (8/10) | 7.77 s
    [Task 13/25]  Current/Best:   17.24/  21.00 GFLOPS | Progress: (10/10) | 8.66 s Done.
-
    [Task 14/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 14/25]  Current/Best:   18.08/  18.08 GFLOPS | Progress: (4/10) | 2.89 s
    [Task 14/25]  Current/Best:   13.93/  18.08 GFLOPS | Progress: (8/10) | 5.38 s
    [Task 14/25]  Current/Best:   11.46/  18.08 GFLOPS | Progress: (10/10) | 6.40 s Done.
-
    [Task 15/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 15/25]  Current/Best:    9.73/  18.08 GFLOPS | Progress: (4/10) | 3.34 s
    [Task 15/25]  Current/Best:   15.40/  19.10 GFLOPS | Progress: (8/10) | 7.58 s
    [Task 15/25]  Current/Best:    9.29/  19.10 GFLOPS | Progress: (10/10) | 10.29 s
    [Task 16/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 16/25]  Current/Best:   10.71/  13.77 GFLOPS | Progress: (4/10) | 2.45 s
    [Task 16/25]  Current/Best:   21.27/  21.27 GFLOPS | Progress: (8/10) | 5.18 s
    [Task 16/25]  Current/Best:   10.29/  21.27 GFLOPS | Progress: (10/10) | 5.89 s Done.
-
    [Task 17/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 17/25]  Current/Best:    6.02/  13.89 GFLOPS | Progress: (4/10) | 3.72 s
    [Task 17/25]  Current/Best:    9.22/  21.85 GFLOPS | Progress: (8/10) | 5.84 s
    [Task 17/25]  Current/Best:   18.83/  21.85 GFLOPS | Progress: (10/10) | 6.80 s Done.
-
    [Task 18/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 18/25]  Current/Best:   11.29/  17.49 GFLOPS | Progress: (4/10) | 8.26 s
    [Task 18/25]  Current/Best:    9.38/  22.13 GFLOPS | Progress: (8/10) | 11.46 s
    [Task 18/25]  Current/Best:   10.35/  22.13 GFLOPS | Progress: (10/10) | 15.54 s Done.
-
    [Task 19/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 19/25]  Current/Best:   10.32/  22.57 GFLOPS | Progress: (4/10) | 4.06 s
    [Task 19/25]  Current/Best:   15.08/  23.47 GFLOPS | Progress: (8/10) | 7.38 s
    [Task 19/25]  Current/Best:   12.62/  23.47 GFLOPS | Progress: (10/10) | 9.03 s Done.
-
    [Task 20/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 20/25]  Current/Best:   12.73/  12.73 GFLOPS | Progress: (4/10) | 3.09 s Done.
-
    [Task 20/25]  Current/Best:   12.74/  12.74 GFLOPS | Progress: (8/10) | 5.70 s
    [Task 20/25]  Current/Best:   10.48/  19.17 GFLOPS | Progress: (10/10) | 7.26 s Done.
-
    [Task 21/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 21/25]  Current/Best:   10.65/  16.39 GFLOPS | Progress: (4/10) | 2.39 s
    [Task 21/25]  Current/Best:   15.40/  18.35 GFLOPS | Progress: (8/10) | 6.83 s
    [Task 21/25]  Current/Best:    0.00/  18.35 GFLOPS | Progress: (10/10) | 7.18 s
    [Task 22/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 22/25]  Current/Best:   14.99/  16.78 GFLOPS | Progress: (4/10) | 2.97 s
    [Task 22/25]  Current/Best:    8.72/  20.07 GFLOPS | Progress: (8/10) | 4.90 s
    [Task 22/25]  Current/Best:    3.09/  20.07 GFLOPS | Progress: (10/10) | 5.87 s Done.
-
    [Task 23/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 23/25]  Current/Best:    1.55/  22.45 GFLOPS | Progress: (4/10) | 5.43 s
    [Task 23/25]  Current/Best:   19.03/  22.45 GFLOPS | Progress: (8/10) | 7.33 s
    [Task 23/25]  Current/Best:   13.38/  22.45 GFLOPS | Progress: (10/10) | 8.25 s Done.
-
    [Task 24/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 24/25]  Current/Best:    4.33/   4.33 GFLOPS | Progress: (4/10) | 218.91 s
    [Task 24/25]  Current/Best:    5.08/   8.58 GFLOPS | Progress: (8/10) | 231.82 s
    [Task 24/25]  Current/Best:    3.85/   8.58 GFLOPS | Progress: (10/10) | 234.78 s
    [Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s Done.
+
    [Task  1/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  1/25]  Current/Best:   17.07/  19.49 GFLOPS | Progress: (4/10) | 4.72 s
    [Task  1/25]  Current/Best:   24.09/  24.09 GFLOPS | Progress: (8/10) | 8.47 s
    [Task  1/25]  Current/Best:   12.84/  24.09 GFLOPS | Progress: (10/10) | 9.35 s Done.
+
    [Task  2/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  2/25]  Current/Best:    7.20/  22.41 GFLOPS | Progress: (4/10) | 2.29 s
    [Task  2/25]  Current/Best:   18.39/  22.41 GFLOPS | Progress: (8/10) | 3.46 s
    [Task  2/25]  Current/Best:    9.43/  22.41 GFLOPS | Progress: (10/10) | 4.18 s Done.
+
    [Task  3/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  3/25]  Current/Best:   12.44/  21.25 GFLOPS | Progress: (4/10) | 2.89 s
    [Task  3/25]  Current/Best:   17.48/  21.25 GFLOPS | Progress: (8/10) | 4.53 s
    [Task  3/25]  Current/Best:   18.40/  21.25 GFLOPS | Progress: (10/10) | 5.96 s Done.
+
    [Task  4/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  4/25]  Current/Best:   12.30/  14.07 GFLOPS | Progress: (4/10) | 6.93 s
    [Task  4/25]  Current/Best:    9.65/  20.11 GFLOPS | Progress: (8/10) | 8.55 s
    [Task  4/25]  Current/Best:   14.18/  20.11 GFLOPS | Progress: (10/10) | 11.60 s Done.
+
    [Task  5/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  5/25]  Current/Best:    9.64/  16.71 GFLOPS | Progress: (4/10) | 3.10 s
    [Task  5/25]  Current/Best:   13.63/  22.67 GFLOPS | Progress: (8/10) | 5.05 s
    [Task  5/25]  Current/Best:    6.13/  22.67 GFLOPS | Progress: (10/10) | 5.87 s Done.
+
    [Task  6/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  6/25]  Current/Best:    9.21/  14.94 GFLOPS | Progress: (4/10) | 3.76 s
    [Task  6/25]  Current/Best:   19.55/  19.55 GFLOPS | Progress: (8/10) | 5.94 s
    [Task  6/25]  Current/Best:    5.31/  19.55 GFLOPS | Progress: (10/10) | 7.04 s Done.
+
    [Task  7/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  7/25]  Current/Best:   15.88/  22.39 GFLOPS | Progress: (4/10) | 2.68 s
    [Task  7/25]  Current/Best:   15.74/  22.39 GFLOPS | Progress: (8/10) | 4.66 s
    [Task  7/25]  Current/Best:    1.59/  22.39 GFLOPS | Progress: (10/10) | 7.24 s Done.
+
    [Task  8/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  8/25]  Current/Best:   15.11/  18.82 GFLOPS | Progress: (4/10) | 2.93 s
    [Task  8/25]  Current/Best:   17.09/  18.82 GFLOPS | Progress: (8/10) | 14.28 s
    [Task  8/25]  Current/Best:   15.52/  18.82 GFLOPS | Progress: (10/10) | 15.04 s
    [Task  9/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task  9/25]  Current/Best:    9.87/  16.51 GFLOPS | Progress: (4/10) | 3.35 s
    [Task  9/25]  Current/Best:   21.53/  21.53 GFLOPS | Progress: (8/10) | 7.51 s
    [Task  9/25]  Current/Best:   12.17/  21.53 GFLOPS | Progress: (10/10) | 8.83 s Done.
+
    [Task 10/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 10/25]  Current/Best:   18.70/  18.70 GFLOPS | Progress: (4/10) | 3.24 s
    [Task 10/25]  Current/Best:   16.25/  18.70 GFLOPS | Progress: (8/10) | 4.92 s
    [Task 10/25]  Current/Best:   14.80/  18.70 GFLOPS | Progress: (10/10) | 5.64 s Done.
+
    [Task 11/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 11/25]  Current/Best:    8.03/  19.17 GFLOPS | Progress: (4/10) | 3.29 s
    [Task 11/25]  Current/Best:    7.67/  19.17 GFLOPS | Progress: (8/10) | 5.87 s
    [Task 11/25]  Current/Best:    9.02/  20.35 GFLOPS | Progress: (10/10) | 6.80 s Done.
+
    [Task 12/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 12/25]  Current/Best:    7.71/  18.45 GFLOPS | Progress: (4/10) | 4.15 s
    [Task 12/25]  Current/Best:   16.12/  18.45 GFLOPS | Progress: (8/10) | 7.24 s
    [Task 12/25]  Current/Best:   17.12/  18.45 GFLOPS | Progress: (10/10) | 8.03 s Done.
+
    [Task 13/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 13/25]  Current/Best:    9.17/  20.78 GFLOPS | Progress: (4/10) | 3.62 s
    [Task 13/25]  Current/Best:    8.79/  20.78 GFLOPS | Progress: (8/10) | 6.17 s
    [Task 13/25]  Current/Best:    7.50/  20.78 GFLOPS | Progress: (10/10) | 7.95 s Done.
+
    [Task 14/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 14/25]  Current/Best:    3.10/  18.74 GFLOPS | Progress: (4/10) | 3.60 s
    [Task 14/25]  Current/Best:   13.76/  18.74 GFLOPS | Progress: (8/10) | 6.71 s
    [Task 14/25]  Current/Best:   10.73/  18.76 GFLOPS | Progress: (10/10) | 8.62 s Done.
+
    [Task 15/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 15/25]  Current/Best:   14.88/  16.19 GFLOPS | Progress: (4/10) | 2.45 s
    [Task 15/25]  Current/Best:    9.79/  21.69 GFLOPS | Progress: (8/10) | 6.08 s Done.
+
    [Task 15/25]  Current/Best:   14.80/  21.69 GFLOPS | Progress: (10/10) | 7.29 s Done.
+
    [Task 16/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 16/25]  Current/Best:   20.48/  20.48 GFLOPS | Progress: (4/10) | 2.18 s
    [Task 16/25]  Current/Best:   21.93/  21.93 GFLOPS | Progress: (8/10) | 4.88 s
    [Task 16/25]  Current/Best:   15.49/  22.39 GFLOPS | Progress: (10/10) | 5.41 s Done.
+
    [Task 17/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 17/25]  Current/Best:    6.17/  18.53 GFLOPS | Progress: (4/10) | 2.81 s
    [Task 17/25]  Current/Best:    7.49/  23.47 GFLOPS | Progress: (8/10) | 5.82 s
    [Task 17/25]  Current/Best:   17.05/  23.47 GFLOPS | Progress: (10/10) | 6.56 s Done.
+
    [Task 18/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 18/25]  Current/Best:   11.12/  18.88 GFLOPS | Progress: (4/10) | 4.12 s
    [Task 18/25]  Current/Best:   10.15/  18.88 GFLOPS | Progress: (8/10) | 8.68 s
    [Task 18/25]  Current/Best:   10.03/  18.88 GFLOPS | Progress: (10/10) | 10.33 s Done.
+
    [Task 19/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 19/25]  Current/Best:   10.99/  17.38 GFLOPS | Progress: (4/10) | 4.27 s
    [Task 19/25]  Current/Best:    1.56/  17.38 GFLOPS | Progress: (8/10) | 10.67 s
    [Task 19/25]  Current/Best:    9.06/  17.38 GFLOPS | Progress: (10/10) | 14.57 s Done.
+
    [Task 20/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 20/25]  Current/Best:    7.32/  14.24 GFLOPS | Progress: (4/10) | 3.25 s
    [Task 20/25]  Current/Best:   10.22/  16.95 GFLOPS | Progress: (8/10) | 6.16 s
    [Task 20/25]  Current/Best:    2.46/  16.95 GFLOPS | Progress: (10/10) | 8.04 s
    [Task 21/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 21/25]  Current/Best:   21.60/  21.60 GFLOPS | Progress: (4/10) | 2.61 s
    [Task 21/25]  Current/Best:   14.47/  21.60 GFLOPS | Progress: (8/10) | 4.45 s
    [Task 21/25]  Current/Best:    8.91/  21.60 GFLOPS | Progress: (10/10) | 5.19 s Done.
+
    [Task 22/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 22/25]  Current/Best:   17.82/  20.36 GFLOPS | Progress: (4/10) | 2.40 s
    [Task 22/25]  Current/Best:   10.62/  20.36 GFLOPS | Progress: (8/10) | 4.10 s
    [Task 22/25]  Current/Best:   12.31/  20.64 GFLOPS | Progress: (10/10) | 4.85 s Done.
+
    [Task 23/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 23/25]  Current/Best:    7.76/  17.21 GFLOPS | Progress: (4/10) | 4.00 s
    [Task 23/25]  Current/Best:    7.13/  20.32 GFLOPS | Progress: (8/10) | 7.85 s
    [Task 23/25]  Current/Best:   18.28/  20.32 GFLOPS | Progress: (10/10) | 8.75 s Done.
+
    [Task 24/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
    [Task 24/25]  Current/Best:   10.59/  10.59 GFLOPS | Progress: (4/10) | 12.92 s
    [Task 24/25]  Current/Best:    3.49/  10.59 GFLOPS | Progress: (8/10) | 15.93 s
    [Task 24/25]  Current/Best:    0.55/  10.59 GFLOPS | Progress: (10/10) | 27.61 s
    [Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s Done.
      Done.
-
    [Task 25/25]  Current/Best:    6.19/   6.19 GFLOPS | Progress: (4/10) | 18.08 s
    [Task 25/25]  Current/Best:    1.51/   9.27 GFLOPS | Progress: (8/10) | 32.75 s
    [Task 25/25]  Current/Best:    1.55/   9.27 GFLOPS | Progress: (10/10) | 52.61 s
+
    [Task 25/25]  Current/Best:    9.18/   9.18 GFLOPS | Progress: (4/10) | 2.59 s
    [Task 25/25]  Current/Best:    9.65/   9.65 GFLOPS | Progress: (8/10) | 21.21 s
    [Task 25/25]  Current/Best:    5.06/   9.65 GFLOPS | Progress: (10/10) | 22.69 s
 
 
 The output from this tuning process will look something like this:
@@ -564,14 +564,6 @@ model using optimized operators to speed up our computations.
 
 
 
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-     Done.
-
 
 
 Verify that the optimized model runs and produces the same results:
@@ -602,8 +594,8 @@ Verify that the optimized model runs and produces the same results:
 
  .. code-block:: none
 
-    class='n02123045 tabby, tabby cat' with probability=0.621105
-    class='n02123159 tiger cat' with probability=0.356377
+    class='n02123045 tabby, tabby cat' with probability=0.621104
+    class='n02123159 tiger cat' with probability=0.356379
     class='n02124075 Egyptian cat' with probability=0.019712
     class='n02129604 tiger, Panthera tigris' with probability=0.001215
     class='n04040759 radiator' with probability=0.000262
@@ -656,8 +648,8 @@ improvement in comparing the optimized model to the unoptimized model.
 
  .. code-block:: none
 
-    optimized: {'mean': 425.86808551000104, 'median': 425.5981838000025, 'std': 1.237847702770817}
-    unoptimized: {'mean': 497.8195587100008, 'median': 497.8889928000001, 'std': 1.2998905415882913}
+    optimized: {'mean': 422.83022096000195, 'median': 422.861491499998, 'std': 0.6089890616715313}
+    unoptimized: {'mean': 490.83991727999893, 'median': 491.0158592500011, 'std': 0.36641131670863597}
 
 
 
@@ -677,7 +669,7 @@ profiling/benchmarking.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 11 minutes  2.536 seconds)
+   **Total running time of the script:** ( 6 minutes  57.341 seconds)
 
 
 .. _sphx_glr_download_tutorial_autotvm_relay_x86.py:
diff --git a/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt b/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
index 11a989b9d..0f472c03b 100644
--- a/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
+++ b/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
@@ -235,7 +235,7 @@ device and returns the measured cost. Network overhead is excluded.
 
  .. code-block:: none
 
-    1.277e-07 secs/op
+    1.242e-07 secs/op
 
 
 
diff --git a/docs/_sources/tutorial/intro_topi.rst.txt b/docs/_sources/tutorial/intro_topi.rst.txt
index 132acb5c9..60e09b659 100644
--- a/docs/_sources/tutorial/intro_topi.rst.txt
+++ b/docs/_sources/tutorial/intro_topi.rst.txt
@@ -230,7 +230,7 @@ As you can see, scheduled stages of computation have been accumulated and we can
 
  .. code-block:: none
 
-    [stage(a, placeholder(a, 0x1662a780)), stage(b, placeholder(b, 0x21590010)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(mi [...]
+    [stage(a, placeholder(a, 0xf2872e0)), stage(b, placeholder(b, 0xd08cf10)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min= [...]
 
 
 
diff --git a/docs/_sources/tutorial/sg_execution_times.rst.txt b/docs/_sources/tutorial/sg_execution_times.rst.txt
index 27cd78b6e..63be73f70 100644
--- a/docs/_sources/tutorial/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorial/sg_execution_times.rst.txt
@@ -5,17 +5,17 @@
 
 Computation times
 =================
-**13:31.365** total execution time for **tutorial** files:
+**09:49.170** total execution time for **tutorial** files:
 
-- **11:02.536**: :ref:`sphx_glr_tutorial_autotvm_relay_x86.py` (``autotvm_relay_x86.py``)
-- **01:01.157**: :ref:`sphx_glr_tutorial_tensor_expr_get_started.py` (``tensor_expr_get_started.py``)
-- **00:42.059**: :ref:`sphx_glr_tutorial_auto_scheduler_matmul_x86.py` (``auto_scheduler_matmul_x86.py``)
-- **00:26.595**: :ref:`sphx_glr_tutorial_relay_quick_start.py` (``relay_quick_start.py``)
-- **00:16.821**: :ref:`sphx_glr_tutorial_autotvm_matmul_x86.py` (``autotvm_matmul_x86.py``)
-- **00:01.050**: :ref:`sphx_glr_tutorial_tensor_ir_blitz_course.py` (``tensor_ir_blitz_course.py``)
-- **00:00.730**: :ref:`sphx_glr_tutorial_intro_topi.py` (``intro_topi.py``)
-- **00:00.229**: :ref:`sphx_glr_tutorial_cross_compilation_and_rpc.py` (``cross_compilation_and_rpc.py``)
-- **00:00.050**: :ref:`sphx_glr_tutorial_introduction.py` (``introduction.py``)
-- **00:00.049**: :ref:`sphx_glr_tutorial_tvmc_python.py` (``tvmc_python.py``)
-- **00:00.045**: :ref:`sphx_glr_tutorial_tvmc_command_line_driver.py` (``tvmc_command_line_driver.py``)
-- **00:00.043**: :ref:`sphx_glr_tutorial_install.py` (``install.py``)
+- **06:57.341**: :ref:`sphx_glr_tutorial_autotvm_relay_x86.py` (``autotvm_relay_x86.py``)
+- **01:00.100**: :ref:`sphx_glr_tutorial_auto_scheduler_matmul_x86.py` (``auto_scheduler_matmul_x86.py``)
+- **00:59.103**: :ref:`sphx_glr_tutorial_tensor_expr_get_started.py` (``tensor_expr_get_started.py``)
+- **00:25.831**: :ref:`sphx_glr_tutorial_relay_quick_start.py` (``relay_quick_start.py``)
+- **00:24.661**: :ref:`sphx_glr_tutorial_autotvm_matmul_x86.py` (``autotvm_matmul_x86.py``)
+- **00:01.101**: :ref:`sphx_glr_tutorial_tensor_ir_blitz_course.py` (``tensor_ir_blitz_course.py``)
+- **00:00.704**: :ref:`sphx_glr_tutorial_intro_topi.py` (``intro_topi.py``)
+- **00:00.198**: :ref:`sphx_glr_tutorial_cross_compilation_and_rpc.py` (``cross_compilation_and_rpc.py``)
+- **00:00.039**: :ref:`sphx_glr_tutorial_introduction.py` (``introduction.py``)
+- **00:00.031**: :ref:`sphx_glr_tutorial_tvmc_python.py` (``tvmc_python.py``)
+- **00:00.031**: :ref:`sphx_glr_tutorial_install.py` (``install.py``)
+- **00:00.031**: :ref:`sphx_glr_tutorial_tvmc_command_line_driver.py` (``tvmc_command_line_driver.py``)
diff --git a/docs/_sources/tutorial/tensor_expr_get_started.rst.txt b/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
index 9bd66d5ea..a3de08215 100644
--- a/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
+++ b/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
@@ -243,7 +243,7 @@ helper function to run a profile of the TVM generated code.
 
  .. code-block:: none
 
-    Numpy running time: 0.000007
+    Numpy running time: 0.000009
     naive: 0.000006
 
 
@@ -387,7 +387,7 @@ factor to be the number of threads on your CPU.
 
  .. code-block:: none
 
-    vector: 0.000025
+    vector: 0.000027
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [(stride: int32*n: int32)], [], type="auto"),
@@ -436,10 +436,10 @@ We can now compare the different schedules
  .. code-block:: none
 
                 Operator                  Timing             Performance
-                   numpy    7.338310001614445e-06                    1.0
-                   naive              5.9001e-06      0.8040134579626596
-                parallel              6.0426e-06      0.8234320979449784
-                  vector             2.46126e-05       3.353987497746098
+                   numpy    9.022220000360903e-06                    1.0
+                   naive    5.8499000000000005e-06    0.6483880907100464
+                parallel              6.0838e-06      0.6743129739417392
+                  vector    2.6628699999999998e-05    2.9514576233936665
 
 
 
@@ -828,7 +828,7 @@ matrix multiplication.
 
  .. code-block:: none
 
-    Numpy running time: 0.018640
+    Numpy running time: 0.018395
 
 
 
@@ -884,7 +884,7 @@ optimizations.
 
  .. code-block:: none
 
-    none: 3.393963
+    none: 3.289817
 
 
 
@@ -982,7 +982,7 @@ schedule.
 
  .. code-block:: none
 
-    blocking: 0.326739
+    blocking: 0.294380
 
 
 
@@ -1073,7 +1073,7 @@ already cache friendly from our previous optimizations.
 
  .. code-block:: none
 
-    vectorization: 0.342312
+    vectorization: 0.330446
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1144,7 +1144,7 @@ more cache friendly.
 
  .. code-block:: none
 
-    loop permutation: 0.121740
+    loop permutation: 0.114469
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1240,7 +1240,7 @@ optimized schedule.
 
  .. code-block:: none
 
-    array packing: 0.110851
+    array packing: 0.108735
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1330,7 +1330,7 @@ to `C` when all the block results are ready.
 
  .. code-block:: none
 
-    block caching: 0.111037
+    block caching: 0.110104
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1413,7 +1413,7 @@ of thread-level parallelization.
 
  .. code-block:: none
 
-    parallelization: 0.144317
+    parallelization: 0.144070
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1491,13 +1491,13 @@ working, we can compare the results.
  .. code-block:: none
 
                 Operator                  Timing             Performance
-                    none      3.3939625678999996                     1.0
-                blocking     0.32673893090000006     0.09627063480024425
-           vectorization            0.3423118515     0.10085905328997309
-        loop permutation     0.12173995230000001    0.035869562455229466
-           array packing     0.11085103339999999     0.03266124218588204
-           block caching     0.11103668380000001     0.03271594237667256
-         parallelization              0.14431674     0.04252160626783088
+                    none            3.2898165229                     1.0
+                blocking            0.2943796007     0.08948207252619123
+           vectorization     0.33044605889999995     0.10044513321633788
+        loop permutation     0.11446943409999999    0.034795081519954876
+           array packing     0.10873529960000002    0.033052086292079584
+           block caching     0.11010433149999999     0.03346822861809391
+         parallelization            0.1440697348     0.04379263518106516
 
 
 
@@ -1532,11 +1532,6 @@ operations with tunable parameters that allows you to automatically optimize
 the computation for specific platforms.
 
 
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 1 minutes  1.157 seconds)
-
-
 .. _sphx_glr_download_tutorial_tensor_expr_get_started.py:
 
 
diff --git a/docs/commit_hash b/docs/commit_hash
index 3ad585ddd..7c83b6136 100644
--- a/docs/commit_hash
+++ b/docs/commit_hash
@@ -1 +1 @@
-f238900e6b64db1c881cbfd8f56f77ed55e061e0
+8bfe3bbb3cc221a8e5d1063f72c1c193c6af5bd9
diff --git a/docs/how_to/compile_models/from_darknet.html b/docs/how_to/compile_models/from_darknet.html
index 1b84141e4..8b2ffd9c9 100644
--- a/docs/how_to/compile_models/from_darknet.html
+++ b/docs/how_to/compile_models/from_darknet.html
@@ -548,7 +548,6 @@ class:[&#39;truck 0.9266&#39;] left:471 right:83 top:689 bottom:169
 class:[&#39;bicycle 0.9984&#39;] left:111 right:113 top:577 bottom:447
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  0.170 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-compile-models-from-darknet-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7716f96385bd5abb6e822041e285be54/from_darknet.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">from_darknet.py</span></code></a></p>
diff --git a/docs/how_to/compile_models/from_mxnet.html b/docs/how_to/compile_models/from_mxnet.html
index cb14c8140..65bfe975a 100644
--- a/docs/how_to/compile_models/from_mxnet.html
+++ b/docs/how_to/compile_models/from_mxnet.html
@@ -400,7 +400,7 @@
 </div>
 <img alt="../../_images/sphx_glr_from_mxnet_001.png" class="sphx-glr-single-img" src="../../_images/sphx_glr_from_mxnet_001.png" />
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zipa9c2d860-11ff-4171-bbe1-7e0abaa3ad4a from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zipeed34e15-264c-4349-9328-89248cd4ff42 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
 x (1, 3, 224, 224)
 </pre></div>
 </div>
diff --git a/docs/how_to/compile_models/from_paddle.html b/docs/how_to/compile_models/from_paddle.html
index 9a21e1ac1..782521d5c 100644
--- a/docs/how_to/compile_models/from_paddle.html
+++ b/docs/how_to/compile_models/from_paddle.html
@@ -463,7 +463,7 @@ A quick solution is</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>TVM prediction top-1 id: 282, class name:  282: &#39;tiger cat&#39;,
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  5.922 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  4.311 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-compile-models-from-paddle-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/16269b77359771348d507395692524cf/from_paddle.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">from_paddle.py</span></code></a></p>
diff --git a/docs/how_to/compile_models/from_pytorch.html b/docs/how_to/compile_models/from_pytorch.html
index d53eaa791..7431cd8b0 100644
--- a/docs/how_to/compile_models/from_pytorch.html
+++ b/docs/how_to/compile_models/from_pytorch.html
@@ -386,9 +386,10 @@ be unstable.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: &quot;https://download.pytorch.org/models/resnet18-f37072fd.pth&quot; to /workspace/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
 
   0%|          | 0.00/44.7M [00:00&lt;?, ?B/s]
- 31%|###       | 13.8M/44.7M [00:00&lt;00:00, 145MB/s]
- 84%|########3 | 37.5M/44.7M [00:00&lt;00:00, 205MB/s]
-100%|##########| 44.7M/44.7M [00:00&lt;00:00, 204MB/s]
+ 12%|#2        | 5.53M/44.7M [00:00&lt;00:00, 58.0MB/s]
+ 25%|##4       | 11.1M/44.7M [00:00&lt;00:00, 55.1MB/s]
+ 76%|#######6  | 34.0M/44.7M [00:00&lt;00:00, 138MB/s]
+100%|##########| 44.7M/44.7M [00:00&lt;00:00, 133MB/s]
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/compile_models/from_tensorflow.html b/docs/how_to/compile_models/from_tensorflow.html
index ea5b5223a..b90a1b068 100644
--- a/docs/how_to/compile_models/from_tensorflow.html
+++ b/docs/how_to/compile_models/from_tensorflow.html
@@ -606,7 +606,6 @@ banana (score = 0.00022)
 desk (score = 0.00019)
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  4.616 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-compile-models-from-tensorflow-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">from_tensorflow.py</span></code></a></p>
diff --git a/docs/how_to/compile_models/sg_execution_times.html b/docs/how_to/compile_models/sg_execution_times.html
index 4dac6b78c..247bf4bcd 100644
--- a/docs/how_to/compile_models/sg_execution_times.html
+++ b/docs/how_to/compile_models/sg_execution_times.html
@@ -300,17 +300,17 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-compile-models-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>04:57.537</strong> total execution time for <strong>how_to_compile_models</strong> files:</p>
+<p><strong>04:42.004</strong> total execution time for <strong>how_to_compile_models</strong> files:</p>
 <ul class="simple">
-<li><p><strong>01:05.922</strong>: <a class="reference internal" href="from_paddle.html#sphx-glr-how-to-compile-models-from-paddle-py"><span class="std std-ref">Compile PaddlePaddle Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_paddle.py</span></code>)</p></li>
-<li><p><strong>01:04.616</strong>: <a class="reference internal" href="from_tensorflow.html#sphx-glr-how-to-compile-models-from-tensorflow-py"><span class="std std-ref">Compile Tensorflow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tensorflow.py</span></code>)</p></li>
-<li><p><strong>01:00.170</strong>: <a class="reference internal" href="from_darknet.html#sphx-glr-how-to-compile-models-from-darknet-py"><span class="std std-ref">Compile YOLO-V2 and YOLO-V3 in DarkNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_darknet.py</span></code>)</p></li>
-<li><p><strong>00:25.787</strong>: <a class="reference internal" href="from_tflite.html#sphx-glr-how-to-compile-models-from-tflite-py"><span class="std std-ref">Compile TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tflite.py</span></code>)</p></li>
-<li><p><strong>00:22.201</strong>: <a class="reference internal" href="from_coreml.html#sphx-glr-how-to-compile-models-from-coreml-py"><span class="std std-ref">Compile CoreML Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_coreml.py</span></code>)</p></li>
-<li><p><strong>00:21.740</strong>: <a class="reference internal" href="from_mxnet.html#sphx-glr-how-to-compile-models-from-mxnet-py"><span class="std std-ref">Compile MXNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_mxnet.py</span></code>)</p></li>
-<li><p><strong>00:19.544</strong>: <a class="reference internal" href="from_pytorch.html#sphx-glr-how-to-compile-models-from-pytorch-py"><span class="std std-ref">Compile PyTorch Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_pytorch.py</span></code>)</p></li>
-<li><p><strong>00:14.715</strong>: <a class="reference internal" href="from_keras.html#sphx-glr-how-to-compile-models-from-keras-py"><span class="std std-ref">Compile Keras Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_keras.py</span></code>)</p></li>
-<li><p><strong>00:02.843</strong>: <a class="reference internal" href="from_onnx.html#sphx-glr-how-to-compile-models-from-onnx-py"><span class="std std-ref">Compile ONNX Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_onnx.py</span></code>)</p></li>
+<li><p><strong>01:04.311</strong>: <a class="reference internal" href="from_paddle.html#sphx-glr-how-to-compile-models-from-paddle-py"><span class="std std-ref">Compile PaddlePaddle Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_paddle.py</span></code>)</p></li>
+<li><p><strong>00:58.956</strong>: <a class="reference internal" href="from_tensorflow.html#sphx-glr-how-to-compile-models-from-tensorflow-py"><span class="std std-ref">Compile Tensorflow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tensorflow.py</span></code>)</p></li>
+<li><p><strong>00:56.850</strong>: <a class="reference internal" href="from_darknet.html#sphx-glr-how-to-compile-models-from-darknet-py"><span class="std std-ref">Compile YOLO-V2 and YOLO-V3 in DarkNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_darknet.py</span></code>)</p></li>
+<li><p><strong>00:25.312</strong>: <a class="reference internal" href="from_tflite.html#sphx-glr-how-to-compile-models-from-tflite-py"><span class="std std-ref">Compile TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tflite.py</span></code>)</p></li>
+<li><p><strong>00:21.831</strong>: <a class="reference internal" href="from_coreml.html#sphx-glr-how-to-compile-models-from-coreml-py"><span class="std std-ref">Compile CoreML Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_coreml.py</span></code>)</p></li>
+<li><p><strong>00:20.843</strong>: <a class="reference internal" href="from_mxnet.html#sphx-glr-how-to-compile-models-from-mxnet-py"><span class="std std-ref">Compile MXNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_mxnet.py</span></code>)</p></li>
+<li><p><strong>00:18.767</strong>: <a class="reference internal" href="from_pytorch.html#sphx-glr-how-to-compile-models-from-pytorch-py"><span class="std std-ref">Compile PyTorch Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_pytorch.py</span></code>)</p></li>
+<li><p><strong>00:12.671</strong>: <a class="reference internal" href="from_keras.html#sphx-glr-how-to-compile-models-from-keras-py"><span class="std std-ref">Compile Keras Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_keras.py</span></code>)</p></li>
+<li><p><strong>00:02.462</strong>: <a class="reference internal" href="from_onnx.html#sphx-glr-how-to-compile-models-from-onnx-py"><span class="std std-ref">Compile ONNX Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_onnx.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/how_to/deploy_models/deploy_model_on_android.html b/docs/how_to/deploy_models/deploy_model_on_android.html
index 9a8f0a153..71df25ca0 100644
--- a/docs/how_to/deploy_models/deploy_model_on_android.html
+++ b/docs/how_to/deploy_models/deploy_model_on_android.html
@@ -622,7 +622,7 @@ to the remote android device.</p>
 Evaluate inference time cost...
 Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  15.8510      15.8504      16.0153      15.6463       0.1026
+  15.7319      15.6878      16.1813      15.5590       0.1696
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/deploy_models/deploy_object_detection_pytorch.html b/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
index a1e388f36..dbe761125 100644
--- a/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
+++ b/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
@@ -409,40 +409,14 @@ be unstable.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: &quot;https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth&quot; to /workspace/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
 
   0%|          | 0.00/170M [00:00&lt;?, ?B/s]
-  3%|2         | 4.36M/170M [00:00&lt;00:03, 45.7MB/s]
-  5%|5         | 9.01M/170M [00:00&lt;00:03, 47.4MB/s]
-  9%|8         | 14.9M/170M [00:00&lt;00:03, 53.7MB/s]
- 12%|#1        | 20.0M/170M [00:00&lt;00:02, 53.7MB/s]
- 15%|#4        | 25.2M/170M [00:00&lt;00:02, 53.9MB/s]
- 18%|#7        | 30.3M/170M [00:00&lt;00:02, 51.9MB/s]
- 21%|##        | 35.3M/170M [00:00&lt;00:02, 51.1MB/s]
- 24%|##3       | 40.2M/170M [00:00&lt;00:02, 50.2MB/s]
- 27%|##6       | 45.1M/170M [00:00&lt;00:02, 50.4MB/s]
- 29%|##9       | 49.9M/170M [00:01&lt;00:02, 50.0MB/s]
- 32%|###2      | 54.7M/170M [00:01&lt;00:02, 47.5MB/s]
- 35%|###4      | 59.2M/170M [00:01&lt;00:02, 47.4MB/s]
- 38%|###8      | 64.8M/170M [00:01&lt;00:02, 50.6MB/s]
- 41%|####1     | 70.1M/170M [00:01&lt;00:02, 52.1MB/s]
- 44%|####4     | 75.1M/170M [00:01&lt;00:02, 45.3MB/s]
- 48%|####7     | 80.9M/170M [00:01&lt;00:01, 49.5MB/s]
- 51%|#####     | 85.8M/170M [00:01&lt;00:01, 49.5MB/s]
- 54%|#####4    | 91.9M/170M [00:01&lt;00:01, 53.4MB/s]
- 57%|#####7    | 97.1M/170M [00:02&lt;00:01, 52.4MB/s]
- 60%|######    | 102M/170M [00:02&lt;00:01, 53.6MB/s]
- 63%|######3   | 108M/170M [00:02&lt;00:01, 48.3MB/s]
- 66%|######6   | 112M/170M [00:02&lt;00:01, 47.1MB/s]
- 69%|######8   | 117M/170M [00:02&lt;00:01, 47.6MB/s]
- 72%|#######1  | 122M/170M [00:02&lt;00:01, 48.8MB/s]
- 76%|#######5  | 128M/170M [00:02&lt;00:00, 53.6MB/s]
- 79%|#######8  | 134M/170M [00:02&lt;00:00, 52.3MB/s]
- 82%|########1 | 139M/170M [00:02&lt;00:00, 52.5MB/s]
- 85%|########4 | 144M/170M [00:02&lt;00:00, 52.6MB/s]
- 88%|########7 | 149M/170M [00:03&lt;00:00, 54.5MB/s]
- 91%|#########1| 155M/170M [00:03&lt;00:00, 51.6MB/s]
- 94%|#########3| 160M/170M [00:03&lt;00:00, 51.6MB/s]
- 97%|#########6| 165M/170M [00:03&lt;00:00, 47.0MB/s]
-100%|#########9| 169M/170M [00:03&lt;00:00, 44.8MB/s]
-100%|##########| 170M/170M [00:03&lt;00:00, 49.9MB/s]
+ 10%|9         | 16.3M/170M [00:00&lt;00:00, 171MB/s]
+ 24%|##3       | 40.4M/170M [00:00&lt;00:00, 219MB/s]
+ 38%|###7      | 64.4M/170M [00:00&lt;00:00, 234MB/s]
+ 52%|#####2    | 88.4M/170M [00:00&lt;00:00, 241MB/s]
+ 66%|######6   | 113M/170M [00:00&lt;00:00, 245MB/s]
+ 80%|########  | 136M/170M [00:00&lt;00:00, 246MB/s]
+ 94%|#########4| 160M/170M [00:00&lt;00:00, 247MB/s]
+100%|##########| 170M/170M [00:00&lt;00:00, 240MB/s]
 /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3878: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
   for i in range(dim)
 /usr/local/lib/python3.7/dist-packages/torchvision/models/detection/anchor_utils.py:127: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the &#39;trunc&#39; function NOT &#39;floor&#39;). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode=&#39;trunc&#39;), or for actual floor division, use torch.div(a, b, rounding_mode=&#39;floor&#39;).
@@ -535,7 +509,7 @@ torchvision rcnn models.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Get 9 valid boxes
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 3 minutes  8.653 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  59.000 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-object-detection-pytorch-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_object_detection_pytorch.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_prequantized.html b/docs/how_to/deploy_models/deploy_prequantized.html
index 3636a46f0..f38d86e2b 100644
--- a/docs/how_to/deploy_models/deploy_prequantized.html
+++ b/docs/how_to/deploy_models/deploy_prequantized.html
@@ -450,7 +450,9 @@ training. Other models require a full post training calibration.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: &quot;https://download.pytorch.org/models/mobilenet_v2-b0353104.pth&quot; to /workspace/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
 
   0%|          | 0.00/13.6M [00:00&lt;?, ?B/s]
-100%|##########| 13.6M/13.6M [00:00&lt;00:00, 145MB/s]
+ 28%|##8       | 3.83M/13.6M [00:00&lt;00:00, 40.1MB/s]
+ 59%|#####9    | 8.06M/13.6M [00:00&lt;00:00, 42.6MB/s]
+100%|##########| 13.6M/13.6M [00:00&lt;00:00, 63.0MB/s]
 </pre></div>
 </div>
 </div>
@@ -539,7 +541,7 @@ output values are identical out of 1000 outputs from mobilenet v2.</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  90.2464      90.1483      91.1529      90.0241       0.2455
+  90.0879      90.0210      90.8503      89.8837       0.1930
 </pre></div>
 </div>
 <div class="admonition note">
@@ -578,7 +580,7 @@ This includes support for the VNNI 8 bit dot product instruction (CascadeLake or
 <div class="section" id="deploy-a-quantized-tflite-model">
 <h2>Deploy a quantized TFLite Model<a class="headerlink" href="#deploy-a-quantized-tflite-model" title="Permalink to this headline">¶</a></h2>
 <p>TODO</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  6.301 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  3.828 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-prequantized-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/fb8217c13f4351224c6cf3aacf1a87fc/deploy_prequantized.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_prequantized.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_prequantized_tflite.html b/docs/how_to/deploy_models/deploy_prequantized_tflite.html
index 8f805efc7..53066f4c1 100644
--- a/docs/how_to/deploy_models/deploy_prequantized_tflite.html
+++ b/docs/how_to/deploy_models/deploy_prequantized_tflite.html
@@ -540,7 +540,7 @@ TFLite Top-5 labels: [387 102 386 341 349]
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  119.9655     119.9535     121.8199     119.1104      0.3561
+  120.2053     120.0332     129.6409     119.2245      1.0663
 </pre></div>
 </div>
 <div class="admonition note">
@@ -568,7 +568,7 @@ network for ARM CPU</span></a>.</p></li>
 </ul>
 </div></blockquote>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  52.691 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  51.148 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-prequantized-tflite-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/56691c7a27d45da61d112276334640d3/deploy_prequantized_tflite.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_prequantized_tflite.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_quantized.html b/docs/how_to/deploy_models/deploy_quantized.html
index 726db1b4f..2ccf53c4b 100644
--- a/docs/how_to/deploy_models/deploy_quantized.html
+++ b/docs/how_to/deploy_models/deploy_quantized.html
@@ -480,7 +480,7 @@ for calibration. But the accuracy might be impacted.</p>
   DeprecationWarning,
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  15.594 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  11.439 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-quantized-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7810ecf51bfc05f7d5e8a400ac3e815d/deploy_quantized.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_quantized.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_ssd_gluoncv.html b/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
index 632116a3f..d3d974e96 100644
--- a/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
+++ b/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
@@ -415,24 +415,25 @@ to your device.</p>
 Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/ssd_512_resnet50_v1_voc-9c8b225a.zip...
 
   0%|          | 0/132723 [00:00&lt;?, ?KB/s]
-  2%|2         | 2805/132723 [00:00&lt;00:04, 28046.95KB/s]
-  6%|6         | 8209/132723 [00:00&lt;00:02, 43333.28KB/s]
- 12%|#2        | 16549/132723 [00:00&lt;00:01, 61625.06KB/s]
- 19%|#8        | 25068/132723 [00:00&lt;00:01, 70924.89KB/s]
- 25%|##5       | 33569/132723 [00:00&lt;00:01, 76000.78KB/s]
- 32%|###1      | 42087/132723 [00:00&lt;00:01, 79119.22KB/s]
- 38%|###8      | 50584/132723 [00:00&lt;00:01, 81029.42KB/s]
- 45%|####4     | 59145/132723 [00:00&lt;00:00, 82476.70KB/s]
- 51%|#####     | 67594/132723 [00:00&lt;00:00, 83102.80KB/s]
- 57%|#####7    | 76168/132723 [00:01&lt;00:00, 83915.47KB/s]
- 64%|######3   | 84560/132723 [00:01&lt;00:00, 71882.90KB/s]
- 69%|######9   | 92044/132723 [00:01&lt;00:00, 71191.74KB/s]
- 75%|#######4  | 99365/132723 [00:01&lt;00:00, 71733.12KB/s]
- 80%|########  | 106684/132723 [00:01&lt;00:00, 58403.00KB/s]
- 86%|########5 | 114001/132723 [00:01&lt;00:00, 62039.63KB/s]
- 91%|######### | 120618/132723 [00:01&lt;00:00, 48370.80KB/s]
- 97%|#########7| 129142/132723 [00:01&lt;00:00, 56547.44KB/s]
-100%|##########| 132723/132723 [00:02&lt;00:00, 66173.10KB/s]
+  3%|3         | 4144/132723 [00:00&lt;00:03, 41437.27KB/s]
+  9%|8         | 11290/132723 [00:00&lt;00:02, 59093.28KB/s]
+ 14%|#4        | 18751/132723 [00:00&lt;00:01, 66174.53KB/s]
+ 20%|##        | 26640/132723 [00:00&lt;00:01, 71191.75KB/s]
+ 25%|##5       | 33760/132723 [00:00&lt;00:01, 69404.79KB/s]
+ 31%|###1      | 41725/132723 [00:00&lt;00:01, 72826.56KB/s]
+ 38%|###7      | 49800/132723 [00:00&lt;00:01, 75383.68KB/s]
+ 43%|####3     | 57680/132723 [00:00&lt;00:00, 76457.72KB/s]
+ 49%|####9     | 65334/132723 [00:00&lt;00:00, 76466.72KB/s]
+ 55%|#####4    | 72987/132723 [00:01&lt;00:00, 76404.53KB/s]
+ 61%|######    | 80632/132723 [00:01&lt;00:00, 76351.90KB/s]
+ 67%|######6   | 88270/132723 [00:01&lt;00:00, 66500.77KB/s]
+ 72%|#######1  | 95145/132723 [00:01&lt;00:00, 65937.94KB/s]
+ 77%|#######6  | 101893/132723 [00:01&lt;00:00, 53808.17KB/s]
+ 83%|########2 | 109644/132723 [00:01&lt;00:00, 59575.68KB/s]
+ 88%|########7 | 116327/132723 [00:01&lt;00:00, 61443.14KB/s]
+ 94%|#########3| 124123/132723 [00:01&lt;00:00, 65889.42KB/s]
+ 99%|#########9| 131929/132723 [00:01&lt;00:00, 69268.25KB/s]
+100%|##########| 132723/132723 [00:01&lt;00:00, 67766.51KB/s]
 </pre></div>
 </div>
 <p>Create TVM runtime and do inference
@@ -472,7 +473,7 @@ Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from h
 </pre></div>
 </div>
 <img alt="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" class="sphx-glr-single-img" src="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" />
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  25.317 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  20.659 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-ssd-gluoncv-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/cccb17d28e5e8b2e94ea8cd5ec59f6ed/deploy_ssd_gluoncv.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_ssd_gluoncv.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/sg_execution_times.html b/docs/how_to/deploy_models/sg_execution_times.html
index 4a3cb3808..0918dc365 100644
--- a/docs/how_to/deploy_models/sg_execution_times.html
+++ b/docs/how_to/deploy_models/sg_execution_times.html
@@ -300,16 +300,16 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-deploy-models-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>10:39.563</strong> total execution time for <strong>how_to_deploy_models</strong> files:</p>
+<p><strong>10:14.652</strong> total execution time for <strong>how_to_deploy_models</strong> files:</p>
 <ul class="simple">
-<li><p><strong>03:08.653</strong>: <a class="reference internal" href="deploy_object_detection_pytorch.html#sphx-glr-how-to-deploy-models-deploy-object-detection-pytorch-py"><span class="std std-ref">Compile PyTorch Object Detection Models</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_object_detection_pytorch.py</span></code>)</p></li>
-<li><p><strong>02:25.317</strong>: <a class="reference internal" href="deploy_ssd_gluoncv.html#sphx-glr-how-to-deploy-models-deploy-ssd-gluoncv-py"><span class="std std-ref">Deploy Single Shot Multibox Detector(SSD) model</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_ssd_gluoncv.py</span></code>)</p></li>
-<li><p><strong>01:52.691</strong>: <a class="reference internal" href="deploy_prequantized_tflite.html#sphx-glr-how-to-deploy-models-deploy-prequantized-tflite-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized_tflite.py</span></code>)</p></li>
-<li><p><strong>01:15.594</strong>: <a class="reference internal" href="deploy_quantized.html#sphx-glr-how-to-deploy-models-deploy-quantized-py"><span class="std std-ref">Deploy a Quantized Model on Cuda</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_quantized.py</span></code>)</p></li>
-<li><p><strong>01:06.301</strong>: <a class="reference internal" href="deploy_prequantized.html#sphx-glr-how-to-deploy-models-deploy-prequantized-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized.py</span></code>)</p></li>
-<li><p><strong>00:29.042</strong>: <a class="reference internal" href="deploy_model_on_android.html#sphx-glr-how-to-deploy-models-deploy-model-on-android-py"><span class="std std-ref">Deploy the Pretrained Model on Android</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_android.py</span></code>)</p></li>
-<li><p><strong>00:21.770</strong>: <a class="reference internal" href="deploy_model_on_rasp.html#sphx-glr-how-to-deploy-models-deploy-model-on-rasp-py"><span class="std std-ref">Deploy the Pretrained Model on Raspberry Pi</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_rasp.py</span></code>)</p></li>
-<li><p><strong>00:00.193</strong>: <a class="reference internal" href="deploy_sparse.html#sphx-glr-how-to-deploy-models-deploy-sparse-py"><span class="std std-ref">Deploy a Hugging Face Pruned Model on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_sparse.py</span></code>)</p></li>
+<li><p><strong>02:58.1000</strong>: <a class="reference internal" href="deploy_object_detection_pytorch.html#sphx-glr-how-to-deploy-models-deploy-object-detection-pytorch-py"><span class="std std-ref">Compile PyTorch Object Detection Models</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_object_detection_pytorch.py</span></code>)</p></li>
+<li><p><strong>02:20.659</strong>: <a class="reference internal" href="deploy_ssd_gluoncv.html#sphx-glr-how-to-deploy-models-deploy-ssd-gluoncv-py"><span class="std std-ref">Deploy Single Shot Multibox Detector(SSD) model</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_ssd_gluoncv.py</span></code>)</p></li>
+<li><p><strong>01:51.148</strong>: <a class="reference internal" href="deploy_prequantized_tflite.html#sphx-glr-how-to-deploy-models-deploy-prequantized-tflite-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized_tflite.py</span></code>)</p></li>
+<li><p><strong>01:11.439</strong>: <a class="reference internal" href="deploy_quantized.html#sphx-glr-how-to-deploy-models-deploy-quantized-py"><span class="std std-ref">Deploy a Quantized Model on Cuda</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_quantized.py</span></code>)</p></li>
+<li><p><strong>01:03.828</strong>: <a class="reference internal" href="deploy_prequantized.html#sphx-glr-how-to-deploy-models-deploy-prequantized-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized.py</span></code>)</p></li>
+<li><p><strong>00:27.257</strong>: <a class="reference internal" href="deploy_model_on_android.html#sphx-glr-how-to-deploy-models-deploy-model-on-android-py"><span class="std std-ref">Deploy the Pretrained Model on Android</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_android.py</span></code>)</p></li>
+<li><p><strong>00:21.141</strong>: <a class="reference internal" href="deploy_model_on_rasp.html#sphx-glr-how-to-deploy-models-deploy-model-on-rasp-py"><span class="std std-ref">Deploy the Pretrained Model on Raspberry Pi</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_rasp.py</span></code>)</p></li>
+<li><p><strong>00:00.181</strong>: <a class="reference internal" href="deploy_sparse.html#sphx-glr-how-to-deploy-models-deploy-sparse-py"><span class="std std-ref">Deploy a Hugging Face Pruned Model on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_sparse.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/how_to/extend_tvm/bring_your_own_datatypes.html b/docs/how_to/extend_tvm/bring_your_own_datatypes.html
index 9e6659570..a1a6b63aa 100644
--- a/docs/how_to/extend_tvm/bring_your_own_datatypes.html
+++ b/docs/how_to/extend_tvm/bring_your_own_datatypes.html
@@ -588,7 +588,7 @@ In this alpha state of the Bring Your Own Datatypes framework, we have not imple
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zipbb3576fa-0825-4ae2-a4d0-31959dcffe09 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip7b13c604-a211-40c8-8a74-68a208b9f928 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
 </pre></div>
 </div>
 <p>It’s easy to execute MobileNet with native TVM:</p>
diff --git a/docs/how_to/extend_tvm/sg_execution_times.html b/docs/how_to/extend_tvm/sg_execution_times.html
index 632db5061..e4c39487d 100644
--- a/docs/how_to/extend_tvm/sg_execution_times.html
+++ b/docs/how_to/extend_tvm/sg_execution_times.html
@@ -300,12 +300,12 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-extend-tvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:38.436</strong> total execution time for <strong>how_to_extend_tvm</strong> files:</p>
+<p><strong>00:37.937</strong> total execution time for <strong>how_to_extend_tvm</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:34.919</strong>: <a class="reference internal" href="bring_your_own_datatypes.html#sphx-glr-how-to-extend-tvm-bring-your-own-datatypes-py"><span class="std std-ref">Bring Your Own Datatypes to TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">bring_your_own_datatypes.py</span></code>)</p></li>
-<li><p><strong>00:02.273</strong>: <a class="reference internal" href="use_pass_instrument.html#sphx-glr-how-to-extend-tvm-use-pass-instrument-py"><span class="std std-ref">How to Use TVM Pass Instrument</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_instrument.py</span></code>)</p></li>
-<li><p><strong>00:01.051</strong>: <a class="reference internal" href="use_pass_infra.html#sphx-glr-how-to-extend-tvm-use-pass-infra-py"><span class="std std-ref">How to Use TVM Pass Infra</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_infra.py</span></code>)</p></li>
-<li><p><strong>00:00.194</strong>: <a class="reference internal" href="low_level_custom_pass.html#sphx-glr-how-to-extend-tvm-low-level-custom-pass-py"><span class="std std-ref">Writing a Customized Pass</span></a> (<code class="docutils literal notranslate"><span class="pre">low_level_custom_pass.py</span></code>)</p></li>
+<li><p><strong>00:34.488</strong>: <a class="reference internal" href="bring_your_own_datatypes.html#sphx-glr-how-to-extend-tvm-bring-your-own-datatypes-py"><span class="std std-ref">Bring Your Own Datatypes to TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">bring_your_own_datatypes.py</span></code>)</p></li>
+<li><p><strong>00:02.219</strong>: <a class="reference internal" href="use_pass_instrument.html#sphx-glr-how-to-extend-tvm-use-pass-instrument-py"><span class="std std-ref">How to Use TVM Pass Instrument</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_instrument.py</span></code>)</p></li>
+<li><p><strong>00:01.044</strong>: <a class="reference internal" href="use_pass_infra.html#sphx-glr-how-to-extend-tvm-use-pass-infra-py"><span class="std std-ref">How to Use TVM Pass Infra</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_infra.py</span></code>)</p></li>
+<li><p><strong>00:00.186</strong>: <a class="reference internal" href="low_level_custom_pass.html#sphx-glr-how-to-extend-tvm-low-level-custom-pass-py"><span class="std std-ref">Writing a Customized Pass</span></a> (<code class="docutils literal notranslate"><span class="pre">low_level_custom_pass.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/how_to/extend_tvm/use_pass_instrument.html b/docs/how_to/extend_tvm/use_pass_instrument.html
index 28b572d00..ee8794e91 100644
--- a/docs/how_to/extend_tvm/use_pass_instrument.html
+++ b/docs/how_to/extend_tvm/use_pass_instrument.html
@@ -486,10 +486,10 @@ profile the execution time of each passes.</p>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Printing results of timing profile...
-InferType: 6061us [6061us] (45.57%; 45.57%)
-FoldScaleAxis: 7240us [2us] (54.43%; 54.43%)
-        FoldConstant: 7238us [1483us] (54.42%; 99.97%)
-                InferType: 5756us [5756us] (43.27%; 79.52%)
+InferType: 6264us [6264us] (46.04%; 46.04%)
+FoldScaleAxis: 7341us [3us] (53.96%; 53.96%)
+        FoldConstant: 7338us [1498us] (53.94%; 99.96%)
+                InferType: 5840us [5840us] (42.93%; 79.59%)
 </pre></div>
 </div>
 </div>
@@ -512,10 +512,10 @@ Refer to following sections and <a class="reference internal" href="../../refere
 </div>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Printing results of timing profile...
-InferType: 5825us [5825us] (44.84%; 44.84%)
-FoldScaleAxis: 7165us [2us] (55.16%; 55.16%)
-        FoldConstant: 7163us [1493us] (55.14%; 99.97%)
-                InferType: 5671us [5671us] (43.65%; 79.16%)
+InferType: 5868us [5868us] (44.66%; 44.66%)
+FoldScaleAxis: 7271us [2us] (55.34%; 55.34%)
+        FoldConstant: 7269us [1511us] (55.32%; 99.97%)
+                InferType: 5758us [5758us] (43.83%; 79.22%)
 </pre></div>
 </div>
 <p>Register empty list to clear existing instruments.</p>
diff --git a/docs/how_to/optimize_operators/opt_conv_cuda.html b/docs/how_to/optimize_operators/opt_conv_cuda.html
index ca68e63d2..d10121b03 100644
--- a/docs/how_to/optimize_operators/opt_conv_cuda.html
+++ b/docs/how_to/optimize_operators/opt_conv_cuda.html
@@ -534,7 +534,7 @@ latency of convolution.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Convolution: 51.352703 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Convolution: 54.117038 ms
 </pre></div>
 </div>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-optimize-operators-opt-conv-cuda-py">
diff --git a/docs/how_to/optimize_operators/opt_conv_tensorcore.html b/docs/how_to/optimize_operators/opt_conv_tensorcore.html
index 4dbf779d6..319379557 100644
--- a/docs/how_to/optimize_operators/opt_conv_tensorcore.html
+++ b/docs/how_to/optimize_operators/opt_conv_tensorcore.html
@@ -876,7 +876,7 @@ be able to run on our build server</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>conv2d with tensor core: 6.634472 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>conv2d with tensor core: 6.532508 ms
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/optimize_operators/opt_gemm.html b/docs/how_to/optimize_operators/opt_gemm.html
index 5b3beb8d2..6d80eaa96 100644
--- a/docs/how_to/optimize_operators/opt_gemm.html
+++ b/docs/how_to/optimize_operators/opt_gemm.html
@@ -431,8 +431,8 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.019783
-Baseline: 3.546348
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.018150
+Baseline: 3.239384
 </pre></div>
 </div>
 <p>In TVM, we can always inspect lower level IR to debug or optimize our schedule.
@@ -493,7 +493,7 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt1: 0.322029
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt1: 0.297664
 </pre></div>
 </div>
 <p>Here is the generated IR after blocking.</p>
@@ -561,7 +561,7 @@ vastly.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt2: 0.348650
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt2: 0.338107
 </pre></div>
 </div>
 <p>Here is the generated IR after vectorization.</p>
@@ -623,7 +623,7 @@ the access pattern for A matrix is more cache friendly.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt3: 0.118034
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt3: 0.115493
 </pre></div>
 </div>
 <p>Here is the generated IR after loop permutation.</p>
@@ -707,7 +707,7 @@ flattening.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt4: 0.111703
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt4: 0.110447
 </pre></div>
 </div>
 <p>Here is the generated IR after array packing.</p>
@@ -794,7 +794,7 @@ write to C when all the block results are ready.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt5: 0.111723
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt5: 0.111191
 </pre></div>
 </div>
 <p>Here is the generated IR after blocking.</p>
@@ -885,7 +885,7 @@ write to C when all the block results are ready.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt6: 0.144970
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt6: 0.143822
 </pre></div>
 </div>
 <p>Here is the generated IR after parallelization.</p>
diff --git a/docs/how_to/optimize_operators/sg_execution_times.html b/docs/how_to/optimize_operators/sg_execution_times.html
index be1061e69..035c1198c 100644
--- a/docs/how_to/optimize_operators/sg_execution_times.html
+++ b/docs/how_to/optimize_operators/sg_execution_times.html
@@ -300,11 +300,11 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-optimize-operators-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:35.915</strong> total execution time for <strong>how_to_optimize_operators</strong> files:</p>
+<p><strong>00:34.280</strong> total execution time for <strong>how_to_optimize_operators</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:33.326</strong>: <a class="reference internal" href="opt_gemm.html#sphx-glr-how-to-optimize-operators-opt-gemm-py"><span class="std std-ref">How to optimize GEMM on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_gemm.py</span></code>)</p></li>
-<li><p><strong>00:01.370</strong>: <a class="reference internal" href="opt_conv_tensorcore.html#sphx-glr-how-to-optimize-operators-opt-conv-tensorcore-py"><span class="std std-ref">How to optimize convolution using TensorCores</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_tensorcore.py</span></code>)</p></li>
-<li><p><strong>00:01.219</strong>: <a class="reference internal" href="opt_conv_cuda.html#sphx-glr-how-to-optimize-operators-opt-conv-cuda-py"><span class="std std-ref">How to optimize convolution on GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_cuda.py</span></code>)</p></li>
+<li><p><strong>00:31.695</strong>: <a class="reference internal" href="opt_gemm.html#sphx-glr-how-to-optimize-operators-opt-gemm-py"><span class="std std-ref">How to optimize GEMM on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_gemm.py</span></code>)</p></li>
+<li><p><strong>00:01.371</strong>: <a class="reference internal" href="opt_conv_tensorcore.html#sphx-glr-how-to-optimize-operators-opt-conv-tensorcore-py"><span class="std std-ref">How to optimize convolution using TensorCores</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_tensorcore.py</span></code>)</p></li>
+<li><p><strong>00:01.214</strong>: <a class="reference internal" href="opt_conv_cuda.html#sphx-glr-how-to-optimize-operators-opt-conv-cuda-py"><span class="std std-ref">How to optimize convolution on GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_cuda.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/how_to/tune_with_autoscheduler/sg_execution_times.html b/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
index 2dcbbe429..4e397ac6c 100644
--- a/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
+++ b/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
@@ -300,14 +300,14 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-tune-with-autoscheduler-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>04:55.287</strong> total execution time for <strong>how_to_tune_with_autoscheduler</strong> files:</p>
+<p><strong>04:53.728</strong> total execution time for <strong>how_to_tune_with_autoscheduler</strong> files:</p>
 <ul class="simple">
-<li><p><strong>02:19.285</strong>: <a class="reference internal" href="tune_conv2d_layer_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-conv2d-layer-cuda-py"><span class="std std-ref">Auto-scheduling a Convolution Layer for GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_layer_cuda.py</span></code>)</p></li>
-<li><p><strong>01:20.977</strong>: <a class="reference internal" href="tune_network_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-x86-py"><span class="std std-ref">Auto-scheduling a Neural Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_x86.py</span></code>)</p></li>
-<li><p><strong>00:40.523</strong>: <a class="reference internal" href="tune_network_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-cuda-py"><span class="std std-ref">Auto-scheduling a Neural Network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_cuda.py</span></code>)</p></li>
-<li><p><strong>00:16.676</strong>: <a class="reference internal" href="tune_sparse_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-sparse-x86-py"><span class="std std-ref">Auto-scheduling Sparse Matrix Multiplication on CPU with Custom Sketch Rule</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_sparse_x86.py</span></code>)</p></li>
-<li><p><strong>00:08.992</strong>: <a class="reference internal" href="tune_network_mali.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-mali-py"><span class="std std-ref">Auto-scheduling a Neural Network for mali GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_mali.py</span></code>)</p></li>
-<li><p><strong>00:08.835</strong>: <a class="reference internal" href="tune_network_arm.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-arm-py"><span class="std std-ref">Auto-scheduling a Neural Network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_arm.py</span></code>)</p></li>
+<li><p><strong>02:20.981</strong>: <a class="reference internal" href="tune_conv2d_layer_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-conv2d-layer-cuda-py"><span class="std std-ref">Auto-scheduling a Convolution Layer for GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_layer_cuda.py</span></code>)</p></li>
+<li><p><strong>01:19.068</strong>: <a class="reference internal" href="tune_network_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-x86-py"><span class="std std-ref">Auto-scheduling a Neural Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_x86.py</span></code>)</p></li>
+<li><p><strong>00:40.131</strong>: <a class="reference internal" href="tune_network_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-cuda-py"><span class="std std-ref">Auto-scheduling a Neural Network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_cuda.py</span></code>)</p></li>
+<li><p><strong>00:16.628</strong>: <a class="reference internal" href="tune_sparse_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-sparse-x86-py"><span class="std std-ref">Auto-scheduling Sparse Matrix Multiplication on CPU with Custom Sketch Rule</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_sparse_x86.py</span></code>)</p></li>
+<li><p><strong>00:08.632</strong>: <a class="reference internal" href="tune_network_mali.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-mali-py"><span class="std std-ref">Auto-scheduling a Neural Network for mali GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_mali.py</span></code>)</p></li>
+<li><p><strong>00:08.287</strong>: <a class="reference internal" href="tune_network_arm.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-arm-py"><span class="std std-ref">Auto-scheduling a Neural Network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_arm.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html b/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
index 2b62061f9..8f0882e9e 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
@@ -470,110 +470,70 @@ cooperative fetching, unrolling and operator fusion.</p>
              compute: Buffer(compute_2: Pointer(float32), float32, [25088], [])}
   buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} {
   attr [IterVar(blockIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;blockIdx.x&quot;)] &quot;thread_extent&quot; = 16;
-  allocate(conv2d_nchw: Pointer(local float32), float32, [14]), storage_scope = local;
-  allocate(pad_temp.shared: Pointer(shared float32), float32, [504]), storage_scope = shared;
-  allocate(kernel.shared: Pointer(shared float32), float32, [768]), storage_scope = shared;
-  attr [IterVar(threadIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112 {
-    conv2d_nchw_1: Buffer(conv2d_nchw, float32, [14], [], scope=&quot;local&quot;, align=32)[0] = 0f32
+  allocate(conv2d_nchw: Pointer(local float32), float32, [7]), storage_scope = local;
+  allocate(pad_temp.shared: Pointer(shared float32), float32, [1008]), storage_scope = shared;
+  allocate(kernel.shared: Pointer(shared float32), float32, [1536]), storage_scope = shared;
+  attr [IterVar(threadIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 224 {
+    conv2d_nchw_1: Buffer(conv2d_nchw, float32, [7], [], scope=&quot;local&quot;, align=16)[0] = 0f32
     conv2d_nchw_1[1] = 0f32
     conv2d_nchw_1[2] = 0f32
     conv2d_nchw_1[3] = 0f32
     conv2d_nchw_1[4] = 0f32
     conv2d_nchw_1[5] = 0f32
     conv2d_nchw_1[6] = 0f32
-    conv2d_nchw_1[7] = 0f32
-    conv2d_nchw_1[8] = 0f32
-    conv2d_nchw_1[9] = 0f32
-    conv2d_nchw_1[10] = 0f32
-    conv2d_nchw_1[11] = 0f32
-    conv2d_nchw_1[12] = 0f32
-    conv2d_nchw_1[13] = 0f32
-    for (rc.outer.outer: int32, 0, 64) {
-      for (ry.outer.outer: int32, 0, 3) {
-        let cse_var_4: int32 = (rc.outer.outer*392)
-        let cse_var_3: int32 = (ry.outer.outer*7)
-        let cse_var_2: int32 = (rc.outer.outer*72)
-        let cse_var_1: int32 = (ry.outer.outer*3)
+    for (rc.outer.outer: int32, 0, 32) {
+      for (rx.outer.outer: int32, 0, 3) {
+        let cse_var_2: int32 = (rc.outer.outer*144)
+        let cse_var_1: int32 = (rc.outer.outer*784)
          {
-          attr [IterVar(threadIdx.x_1: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          pad_temp.shared_1: Buffer(pad_temp.shared, float32, [504], [], scope=&quot;shared&quot;)[threadIdx.x_1] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(threadIdx.x_1, 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(threadIdx.x_1, 9))) &amp;&amp; (floormod(threadIdx.x_1, 9) &lt; 8)), data[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) - 8)], 0f3 [...]
-          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          pad_temp.shared_1[(threadIdx.x_1 + 112)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 112), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 112), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 4), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 4), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 112), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
-          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          pad_temp.shared_1[(threadIdx.x_1 + 224)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 224), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 224), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 8), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 8), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 224), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
-          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          pad_temp.shared_1[(threadIdx.x_1 + 336)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 336), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 336), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 3), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 3), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 336), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
-          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          if @tir.likely((threadIdx.x_1 &lt; 56), dtype=bool) {
-            pad_temp.shared_1[(threadIdx.x_1 + 448)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 448), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 448), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 7), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 7), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 448), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 224 {
+            pad_temp.shared_1: Buffer(pad_temp.shared, float32, [1008], [], scope=&quot;shared&quot;)[(threadIdx.x_1*2)] = @tir.if_then_else(((((7 &lt;= floormod((threadIdx.x_1*2), 63)) &amp;&amp; (floormod((threadIdx.x_1*2), 63) &lt; 56)) &amp;&amp; (1 &lt;= (rx.outer.outer + floormod((threadIdx.x_1*2), 7)))) &amp;&amp; ((rx.outer.outer + floormod((threadIdx.x_1*2), 7)) &lt; 8)), data[((((cse_var_1 + (floordiv((threadIdx.x_1*2), 63)*49)) + rx.outer.outer) + floormod((threadIdx.x_1*2), 6 [...]
+            pad_temp.shared_1[((threadIdx.x_1*2) + 1)] = @tir.if_then_else(((((7 &lt;= floormod(((threadIdx.x_1*2) + 1), 63)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 63) &lt; 56)) &amp;&amp; (1 &lt;= (rx.outer.outer + floormod(((threadIdx.x_1*2) + 1), 7)))) &amp;&amp; ((rx.outer.outer + floormod(((threadIdx.x_1*2) + 1), 7)) &lt; 8)), data[((((cse_var_1 + (floordiv(((threadIdx.x_1*2) + 1), 63)*49)) + rx.outer.outer) + floormod(((threadIdx.x_1*2) + 1), 63)) - 8)], 0f32, dtype=float32)
           }
-          attr [IterVar(threadIdx.x_2: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          kernel.shared_1: Buffer(kernel.shared, float32, [768], [], scope=&quot;shared&quot;)[threadIdx.x_2] = kernel[((((((blockIdx.x*147456) + (floordiv(threadIdx.x_2, 24)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
-          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          kernel.shared_1[(threadIdx.x_2 + 112)] = kernel[((((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 8) + 14), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 112), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
-          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          kernel.shared_1[(threadIdx.x_2 + 224)] = kernel[((((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 8) + 28), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 224), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
-          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          kernel.shared_1[(threadIdx.x_2 + 336)] = kernel[(((((((blockIdx.x*147456) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 64512)]
-          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          kernel.shared_1[(threadIdx.x_2 + 448)] = kernel[((((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 8) + 56), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 448), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
-          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          kernel.shared_1[(threadIdx.x_2 + 560)] = kernel[((((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 8) + 70), 3)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 560), 24), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
-          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-          if @tir.likely((threadIdx.x_2 &lt; 96), dtype=bool) {
-            kernel.shared_1[(threadIdx.x_2 + 672)] = kernel[(((((((blockIdx.x*147456) + (floordiv(floordiv(threadIdx.x_2, 8), 3)*4608)) + cse_var_2) + (floordiv(floormod(threadIdx.x_2, 24), 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 129024)]
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 224 {
+            pad_temp.shared_1[(((floordiv((floordiv((threadIdx.x_1*2), 7) + 64), 9)*63) + (floormod((floordiv((threadIdx.x_1*2), 7) + 1), 9)*7)) + floormod((threadIdx.x_1*2), 7))] = @tir.if_then_else(((((1 &lt;= floormod((floordiv((threadIdx.x_1*2), 7) + 1), 9)) &amp;&amp; (floormod((floordiv((threadIdx.x_1*2), 7) + 1), 9) &lt; 8)) &amp;&amp; (1 &lt;= (rx.outer.outer + floormod((threadIdx.x_1*2), 7)))) &amp;&amp; ((rx.outer.outer + floormod((threadIdx.x_1*2), 7)) &lt; 8)), data[(((((cse_ [...]
+            pad_temp.shared_1[(((floordiv((floordiv(((threadIdx.x_1*2) + 1), 7) + 64), 9)*63) + (floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 1), 9)*7)) + floormod(((threadIdx.x_1*2) + 1), 7))] = @tir.if_then_else(((((1 &lt;= floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 1), 9)) &amp;&amp; (floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 1), 9) &lt; 8)) &amp;&amp; (1 &lt;= (rx.outer.outer + floormod(((threadIdx.x_1*2) + 1), 7)))) &amp;&amp; ((rx.outer.outer + floormod(((threadIdx [...]
           }
-          for (rc.outer.inner: int32, 0, 8) {
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9))]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 1)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-            conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-            conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-            conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3))]))
-            conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9))]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-            conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-            conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-            conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-            conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-            conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-            conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 24)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-            conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-            conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-            conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 7)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 1)]))
-            conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-            conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-            conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-            conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-            conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-            conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-            conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 7)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 25)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-            conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-            conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 7)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-            conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 8)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 2)]))
-            conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-            conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 3)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-            conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 4)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-            conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 5)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-            conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 6)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-            conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 7)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
-            conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + (floormod(threadIdx.x, 7)*9)) + 8)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + 26)]))
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 224 {
+            if @tir.likely((threadIdx.x_1 &lt; 56), dtype=bool) {
+              pad_temp.shared_1[(((floordiv((floordiv((threadIdx.x_1*2), 7) + 128), 9)*63) + (floormod((floordiv((threadIdx.x_1*2), 7) + 2), 9)*7)) + floormod((threadIdx.x_1*2), 7))] = @tir.if_then_else(((((1 &lt;= floormod((floordiv((threadIdx.x_1*2), 7) + 2), 9)) &amp;&amp; (floormod((floordiv((threadIdx.x_1*2), 7) + 2), 9) &lt; 8)) &amp;&amp; (1 &lt;= (rx.outer.outer + floormod((threadIdx.x_1*2), 7)))) &amp;&amp; ((rx.outer.outer + floormod((threadIdx.x_1*2), 7)) &lt; 8)), data[(((((c [...]
+            }
+            if @tir.likely((threadIdx.x_1 &lt; 56), dtype=bool) {
+              pad_temp.shared_1[(((floordiv((floordiv(((threadIdx.x_1*2) + 1), 7) + 128), 9)*63) + (floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 2), 9)*7)) + floormod(((threadIdx.x_1*2) + 1), 7))] = @tir.if_then_else(((((1 &lt;= floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 2), 9)) &amp;&amp; (floormod((floordiv(((threadIdx.x_1*2) + 1), 7) + 2), 9) &lt; 8)) &amp;&amp; (1 &lt;= (rx.outer.outer + floormod(((threadIdx.x_1*2) + 1), 7)))) &amp;&amp; ((rx.outer.outer + floormod(((thread [...]
+            }
+          }
+          attr [IterVar(threadIdx.x_2: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 224;
+          kernel.shared_1: Buffer(kernel.shared, float32, [1536], [], scope=&quot;shared&quot;)[threadIdx.x_2] = kernel[(((((blockIdx.x*147456) + (floordiv(threadIdx.x_2, 48)*4608)) + cse_var_2) + (floormod(threadIdx.x_2, 48)*3)) + rx.outer.outer)]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 224;
+          kernel.shared_1[(threadIdx.x_2 + 224)] = kernel[(((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 16) + 14), 3)*4608)) + cse_var_2) + (floormod((threadIdx.x_2 + 32), 48)*3)) + rx.outer.outer)]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 224;
+          kernel.shared_1[(threadIdx.x_2 + 448)] = kernel[(((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 16) + 28), 3)*4608)) + cse_var_2) + (floormod((threadIdx.x_2 + 16), 48)*3)) + rx.outer.outer)]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 224;
+          kernel.shared_1[(threadIdx.x_2 + 672)] = kernel[((((((blockIdx.x*147456) + (floordiv(floordiv(threadIdx.x_2, 16), 3)*4608)) + cse_var_2) + (floormod(threadIdx.x_2, 48)*3)) + rx.outer.outer) + 64512)]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 224;
+          kernel.shared_1[(threadIdx.x_2 + 896)] = kernel[(((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 16) + 56), 3)*4608)) + cse_var_2) + (floormod((threadIdx.x_2 + 32), 48)*3)) + rx.outer.outer)]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 224;
+          kernel.shared_1[(threadIdx.x_2 + 1120)] = kernel[(((((blockIdx.x*147456) + (floordiv((floordiv(threadIdx.x_2, 16) + 70), 3)*4608)) + cse_var_2) + (floormod((threadIdx.x_2 + 16), 48)*3)) + rx.outer.outer)]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 224;
+          if @tir.likely((threadIdx.x_2 &lt; 192), dtype=bool) {
+            kernel.shared_1[(threadIdx.x_2 + 1344)] = kernel[((((((blockIdx.x*147456) + (floordiv(floordiv(threadIdx.x_2, 16), 3)*4608)) + cse_var_2) + (floormod(threadIdx.x_2, 48)*3)) + rx.outer.outer) + 129024)]
+          }
+          for (rc.outer.inner: int32, 0, 16) {
+            for (ry.outer.inner: int32, 0, 3) {
+              conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+              conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 7)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+              conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 14)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+              conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 21)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+              conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 28)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+              conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 35)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+              conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((((rc.outer.inner*63) + (ry.outer.inner*7)) + floormod(threadIdx.x, 7)) + 42)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*48) + (rc.outer.inner*3)) + ry.outer.inner)]))
+            }
           }
         }
       }
     }
-    for (i1.inner: int32, 0, 2) {
-      for (i3.inner: int32, 0, 7) {
-        compute[(((((blockIdx.x*1568) + (floordiv(threadIdx.x, 7)*98)) + (i1.inner*49)) + (floormod(threadIdx.x, 7)*7)) + i3.inner)] = max((conv2d_nchw_1[((i1.inner*7) + i3.inner)] + bias[(((blockIdx.x*32) + (floordiv(threadIdx.x, 7)*2)) + i1.inner)]), 0f32)
-      }
+    for (i2.inner: int32, 0, 7) {
+      compute[((((blockIdx.x*1568) + (floordiv(threadIdx.x, 7)*49)) + (i2.inner*7)) + floormod(threadIdx.x, 7))] = max((conv2d_nchw_1[i2.inner] + bias[((blockIdx.x*32) + floordiv(threadIdx.x, 7))]), 0f32)
     }
   }
 }
@@ -611,7 +571,7 @@ cooperative fetching, unrolling and operator fusion.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 0.276 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 0.317 ms
 </pre></div>
 </div>
 </div>
@@ -642,35 +602,35 @@ conv2d_nchw_nn_o_o_i, conv2d_nchw_nn_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o
 conv2d_nchw_nn_o_o_o_i, conv2d_nchw_nn_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_i, factor=1)
 conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_o_i, factor=1)
 conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=1)
-conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=2)
-conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=16)
+conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=1)
+conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=32)
 conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=1)
-conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1)
+conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=7)
 conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
-conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=7)
+conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=1)
 conv2d_nchw_yy_o_o_o_o, conv2d_nchw_yy_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_o_i, factor=1)
 conv2d_nchw_xx_o_i, conv2d_nchw_xx_i = s[conv2d_nchw].split(conv2d_nchw_xx, factor=1)
-conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=7)
-conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=1)
+conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=1)
+conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=7)
 conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
 conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=1)
-conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=8)
+conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=16)
 conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=1)
-conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
+conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=3)
 conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=1)
-conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=3)
+conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1)
 s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2d_nc [...]
 compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
 compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
 compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
-compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=2)
-compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=16)
+compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=1)
+compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=32)
 compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
-compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
-compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
+compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=7)
+compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=1)
 compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
-compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=7)
-compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)
+compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=1)
+compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=7)
 compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
 s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
 s[conv2d_nchw].compute_at(s[compute], compute_i3_o_i)
@@ -690,14 +650,14 @@ s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread
 kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
 kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
 s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)
+kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=224)
 s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis(&quot;threadIdx.x&quot;))
 pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
-pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
+pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=2)
 s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)
+pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=224)
 s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis(&quot;threadIdx.x&quot;))
-s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, &quot;auto_unroll_max_step&quot;, 64)
+s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, &quot;auto_unroll_max_step&quot;, 16)
 s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, &quot;unroll_explicit&quot;, True)
 
 CUDA source code:
@@ -715,10 +675,10 @@ CUDA source code:
   #define int64_t long long
   #define uint64_t unsigned long long
 #endif
-extern &quot;C&quot; __global__ void __launch_bounds__(112) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
-  float conv2d_nchw[14];
-  __shared__ float pad_temp_shared[504];
-  __shared__ float kernel_shared[768];
+extern &quot;C&quot; __global__ void __launch_bounds__(224) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
+  float conv2d_nchw[7];
+  __shared__ float pad_temp_shared[1008];
+  __shared__ float kernel_shared[1536];
   conv2d_nchw[0] = 0.000000e+00f;
   conv2d_nchw[1] = 0.000000e+00f;
   conv2d_nchw[2] = 0.000000e+00f;
@@ -726,83 +686,44 @@ extern &quot;C&quot; __global__ void __launch_bounds__(112) default_function_ker
   conv2d_nchw[4] = 0.000000e+00f;
   conv2d_nchw[5] = 0.000000e+00f;
   conv2d_nchw[6] = 0.000000e+00f;
-  conv2d_nchw[7] = 0.000000e+00f;
-  conv2d_nchw[8] = 0.000000e+00f;
-  conv2d_nchw[9] = 0.000000e+00f;
-  conv2d_nchw[10] = 0.000000e+00f;
-  conv2d_nchw[11] = 0.000000e+00f;
-  conv2d_nchw[12] = 0.000000e+00f;
-  conv2d_nchw[13] = 0.000000e+00f;
-  for (int rc_outer_outer = 0; rc_outer_outer &lt; 64; ++rc_outer_outer) {
-    for (int ry_outer_outer = 0; ry_outer_outer &lt; 3; ++ry_outer_outer) {
+  for (int rc_outer_outer = 0; rc_outer_outer &lt; 32; ++rc_outer_outer) {
+    for (int rx_outer_outer = 0; rx_outer_outer &lt; 3; ++rx_outer_outer) {
       __syncthreads();
-      pad_temp_shared[((int)threadIdx.x)] = (((((1 &lt;= (((((int)threadIdx.x) % 63) / 9) + ry_outer_outer)) &amp;&amp; ((((((int)threadIdx.x) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= (((int)threadIdx.x) % 9))) &amp;&amp; ((((int)threadIdx.x) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 392) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
-      pad_temp_shared[(((int)threadIdx.x) + 112)] = (((((1 &lt;= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 4) % 9))) &amp;&amp; (((((int)threadIdx.x) + 4) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 392) + (((((int)threadIdx.x) + 112) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
-      pad_temp_shared[(((int)threadIdx.x) + 224)] = (((((1 &lt;= ((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 35) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 8) % 9))) &amp;&amp; (((((int)threadIdx.x) + 8) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 392) + (((((int)threadIdx.x) + 224) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
-      pad_temp_shared[(((int)threadIdx.x) + 336)] = (((((1 &lt;= ((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 21) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 3) % 9))) &amp;&amp; (((((int)threadIdx.x) + 3) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 392) + (((((int)threadIdx.x) + 336) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) * 2)] = (((((7 &lt;= ((((int)threadIdx.x) * 2) % 63)) &amp;&amp; (((((int)threadIdx.x) * 2) % 63) &lt; 56)) &amp;&amp; (1 &lt;= (rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)))) &amp;&amp; ((rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) * 2) / 63) * 49)) + rx_outer_outer) + ((((int)threadIdx.x) * 2) % 63)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[((((int)threadIdx.x) * 2) + 1)] = (((((7 &lt;= (((((int)threadIdx.x) * 2) + 1) % 63)) &amp;&amp; ((((((int)threadIdx.x) * 2) + 1) % 63) &lt; 56)) &amp;&amp; (1 &lt;= (rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)))) &amp;&amp; ((rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)) &lt; 8)) ? data[(((((rc_outer_outer * 784) + ((((((int)threadIdx.x) * 2) + 1) / 63) * 49)) + rx_outer_outer) + (((((int)threadIdx.x) * 2) + 1) % 63)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[((((((((int)threadIdx.x) * 2) + 448) / 63) * 63) + (((((((int)threadIdx.x) * 2) / 7) + 1) % 9) * 7)) + ((((int)threadIdx.x) * 2) % 7))] = (((((1 &lt;= ((((((int)threadIdx.x) * 2) / 7) + 1) % 9)) &amp;&amp; (((((((int)threadIdx.x) * 2) / 7) + 1) % 9) &lt; 8)) &amp;&amp; (1 &lt;= (rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)))) &amp;&amp; ((rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)) &lt; 8)) ? data[((((((rc_outer_outer * 784) + ((((((int)threadIdx.x) * 2) + 4 [...]
+      pad_temp_shared[((((((((int)threadIdx.x) * 2) + 449) / 63) * 63) + ((((((((int)threadIdx.x) * 2) + 1) / 7) + 1) % 9) * 7)) + (((((int)threadIdx.x) * 2) + 1) % 7))] = (((((1 &lt;= (((((((int)threadIdx.x) * 2) + 1) / 7) + 1) % 9)) &amp;&amp; ((((((((int)threadIdx.x) * 2) + 1) / 7) + 1) % 9) &lt; 8)) &amp;&amp; (1 &lt;= (rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)))) &amp;&amp; ((rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)) &lt; 8)) ? data[((((((rc_outer_outer * 78 [...]
       if (((int)threadIdx.x) &lt; 56) {
-        pad_temp_shared[(((int)threadIdx.x) + 448)] = (((((1 &lt;= ((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 7) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 7) % 9))) &amp;&amp; (((((int)threadIdx.x) + 7) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 392) + (((((int)threadIdx.x) + 448) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+        pad_temp_shared[((((((((int)threadIdx.x) * 2) + 896) / 63) * 63) + (((((((int)threadIdx.x) * 2) / 7) + 2) % 9) * 7)) + ((((int)threadIdx.x) * 2) % 7))] = (((((1 &lt;= ((((((int)threadIdx.x) * 2) / 7) + 2) % 9)) &amp;&amp; (((((((int)threadIdx.x) * 2) / 7) + 2) % 9) &lt; 8)) &amp;&amp; (1 &lt;= (rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)))) &amp;&amp; ((rx_outer_outer + ((((int)threadIdx.x) * 2) % 7)) &lt; 8)) ? data[((((((rc_outer_outer * 784) + ((((((int)threadIdx.x) * 2) + [...]
       }
-      kernel_shared[((int)threadIdx.x)] = kernel[((((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
-      kernel_shared[(((int)threadIdx.x) + 112)] = kernel[((((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 112) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-      kernel_shared[(((int)threadIdx.x) + 224)] = kernel[((((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 224) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-      kernel_shared[(((int)threadIdx.x) + 336)] = kernel[(((((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 64512)];
-      kernel_shared[(((int)threadIdx.x) + 448)] = kernel[((((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 448) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 16) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-      kernel_shared[(((int)threadIdx.x) + 560)] = kernel[((((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 560) / 24) * 4608)) + (rc_outer_outer * 72)) + ((((((int)threadIdx.x) + 8) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-      if (((int)threadIdx.x) &lt; 96) {
-        kernel_shared[(((int)threadIdx.x) + 672)] = kernel[(((((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 24) * 4608)) + (rc_outer_outer * 72)) + (((((int)threadIdx.x) % 24) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 129024)];
+      if (((int)threadIdx.x) &lt; 56) {
+        pad_temp_shared[((((((((int)threadIdx.x) * 2) + 897) / 63) * 63) + ((((((((int)threadIdx.x) * 2) + 1) / 7) + 2) % 9) * 7)) + (((((int)threadIdx.x) * 2) + 1) % 7))] = (((((1 &lt;= (((((((int)threadIdx.x) * 2) + 1) / 7) + 2) % 9)) &amp;&amp; ((((((((int)threadIdx.x) * 2) + 1) / 7) + 2) % 9) &lt; 8)) &amp;&amp; (1 &lt;= (rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)))) &amp;&amp; ((rx_outer_outer + (((((int)threadIdx.x) * 2) + 1) % 7)) &lt; 8)) ? data[((((((rc_outer_outer *  [...]
+      }
+      kernel_shared[((int)threadIdx.x)] = kernel[(((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) % 48) * 3)) + rx_outer_outer)];
+      kernel_shared[(((int)threadIdx.x) + 224)] = kernel[(((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 224) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 32) % 48) * 3)) + rx_outer_outer)];
+      kernel_shared[(((int)threadIdx.x) + 448)] = kernel[(((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 448) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 16) % 48) * 3)) + rx_outer_outer)];
+      kernel_shared[(((int)threadIdx.x) + 672)] = kernel[((((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) % 48) * 3)) + rx_outer_outer) + 64512)];
+      kernel_shared[(((int)threadIdx.x) + 896)] = kernel[(((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 896) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 32) % 48) * 3)) + rx_outer_outer)];
+      kernel_shared[(((int)threadIdx.x) + 1120)] = kernel[(((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 1120) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 16) % 48) * 3)) + rx_outer_outer)];
+      if (((int)threadIdx.x) &lt; 192) {
+        kernel_shared[(((int)threadIdx.x) + 1344)] = kernel[((((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) % 48) * 3)) + rx_outer_outer) + 129024)];
       }
       __syncthreads();
-      for (int rc_outer_inner = 0; rc_outer_inner &lt; 8; ++rc_outer_inner) {
-        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9))] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 1)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-        conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-        conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-        conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[(((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3))]));
-        conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-        conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-        conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-        conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-        conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-        conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-        conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 24)]));
-        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-        conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-        conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-        conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 7)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 1)]));
-        conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-        conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-        conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-        conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-        conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-        conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-        conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 7)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 25)]));
-        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-        conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-        conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 7)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-        conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 8)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 2)]));
-        conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-        conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 3)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-        conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 4)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-        conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 5)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-        conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 6)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-        conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 7)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
-        conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + ((((int)threadIdx.x) % 7) * 9)) + 8)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + 26)]));
+      for (int rc_outer_inner = 0; rc_outer_inner &lt; 16; ++rc_outer_inner) {
+        for (int ry_outer_inner = 0; ry_outer_inner &lt; 3; ++ry_outer_inner) {
+          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 7)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 14)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 21)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 28)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 35)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_outer_inner * 63) + (ry_outer_inner * 7)) + (((int)threadIdx.x) % 7)) + 42)] * kernel_shared[((((((int)threadIdx.x) / 7) * 48) + (rc_outer_inner * 3)) + ry_outer_inner)]));
+        }
       }
     }
   }
-  for (int i1_inner = 0; i1_inner &lt; 2; ++i1_inner) {
-    for (int i3_inner = 0; i3_inner &lt; 7; ++i3_inner) {
-      compute[(((((((int)blockIdx.x) * 1568) + ((((int)threadIdx.x) / 7) * 98)) + (i1_inner * 49)) + ((((int)threadIdx.x) % 7) * 7)) + i3_inner)] = max((conv2d_nchw[((i1_inner * 7) + i3_inner)] + bias[(((((int)blockIdx.x) * 32) + ((((int)threadIdx.x) / 7) * 2)) + i1_inner)]), 0.000000e+00f);
-    }
+  for (int i2_inner = 0; i2_inner &lt; 7; ++i2_inner) {
+    compute[((((((int)blockIdx.x) * 1568) + ((((int)threadIdx.x) / 7) * 49)) + (i2_inner * 7)) + (((int)threadIdx.x) % 7))] = max((conv2d_nchw[i2_inner] + bias[((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 7))]), 0.000000e+00f);
   }
 }
 </pre></div>
@@ -840,7 +761,7 @@ In the example below we resume the status and do more 5 trials.</p>
 Get devices for measurement successfully!
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  19.285 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  20.981 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autoscheduler-tune-conv2d-layer-cuda-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tune_conv2d_layer_cuda.py</span></code></a></p>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html b/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
index ede3fd8be..878d15b32 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
@@ -876,7 +876,7 @@ so we can read the log file and load the best schedules.</p>
 Evaluate inference time cost...
 Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-   9.8262       9.8166       9.8458       9.8162       0.0139
+   9.6649       9.6723       9.6914       9.6310       0.0252
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_network_x86.html b/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
index acd6886d7..24bcc50b7 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
@@ -895,7 +895,7 @@ so we can read the log file and load the best schedules.</p>
 Evaluate inference time cost...
 Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  779.4330     778.2617     785.7413     774.2961      4.7454
+  760.7450     763.7720     763.9236     754.5393      4.3885
 </pre></div>
 </div>
 </div>
@@ -917,7 +917,7 @@ to learn how to use the RPC Tracker and RPC Server.
 To use the RPC Tracker in auto-scheduler, replace the runner in <code class="code docutils literal notranslate"><span class="pre">TuningOptions</span></code>
 with <a class="reference internal" href="../../reference/api/python/auto_scheduler.html#tvm.auto_scheduler.RPCRunner" title="tvm.auto_scheduler.RPCRunner"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.RPCRunner</span></code></a>.</p></li>
 </ol>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  20.977 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  19.068 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autoscheduler-tune-network-x86-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/e416b94ca1090b0897c0f6e0df95b911/tune_network_x86.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tune_network_x86.py</span></code></a></p>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html b/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
index 4717a85f9..a8df6a18e 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
@@ -600,73 +600,29 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
              placeholder_4: Buffer(placeholder_14: Pointer(float32), float32, [65536], []),
              compute: Buffer(compute_2: Pointer(float32), float32, [65536], [])}
   buffer_map = {placeholder_5: placeholder, placeholder_6: placeholder_1, placeholder_7: placeholder_2, placeholder_8: placeholder_3, placeholder_9: placeholder_4, compute_1: compute} {
-  for (i0.outer.i1.outer.fused: int32, 0, 256) &quot;parallel&quot; {
-    allocate(compute_3: Pointer(global float32), float32, [256]), storage_scope = global {
-      for (i.inner.init: int32, 0, 16) {
-        let cse_var_1: int32 = (i.inner.init*16)
-         {
-          compute_4: Buffer(compute_3, float32, [256], [])[cse_var_1] = 0f32
-          compute_4[(cse_var_1 + 1)] = 0f32
-          compute_4[(cse_var_1 + 2)] = 0f32
-          compute_4[(cse_var_1 + 3)] = 0f32
-          compute_4[(cse_var_1 + 4)] = 0f32
-          compute_4[(cse_var_1 + 5)] = 0f32
-          compute_4[(cse_var_1 + 6)] = 0f32
-          compute_4[(cse_var_1 + 7)] = 0f32
-          compute_4[(cse_var_1 + 8)] = 0f32
-          compute_4[(cse_var_1 + 9)] = 0f32
-          compute_4[(cse_var_1 + 10)] = 0f32
-          compute_4[(cse_var_1 + 11)] = 0f32
-          compute_4[(cse_var_1 + 12)] = 0f32
-          compute_4[(cse_var_1 + 13)] = 0f32
-          compute_4[(cse_var_1 + 14)] = 0f32
-          compute_4[(cse_var_1 + 15)] = 0f32
-        }
-      }
-      for (elem_idx: int32, 0, let cse_var_2: int32 = floormod(i0.outer.i1.outer.fused, 32) in (placeholder_3[(cse_var_2 + 1)] - placeholder_3[cse_var_2])) {
-        for (i.inner: int32, 0, 16) {
-          let cse_var_21: int32 = floormod(i0.outer.i1.outer.fused, 32)
-          let cse_var_20: int32 = (i.inner*16)
-          let cse_var_19: int32 = (elem_idx*16)
-          let cse_var_18: int32 = (cse_var_20 + 10)
-          let cse_var_17: int32 = (cse_var_20 + 11)
-          let cse_var_16: int32 = (cse_var_20 + 12)
-          let cse_var_15: int32 = (cse_var_20 + 13)
-          let cse_var_14: int32 = (cse_var_20 + 14)
-          let cse_var_13: int32 = (cse_var_20 + 15)
-          let cse_var_12: int32 = (cse_var_20 + 2)
-          let cse_var_11: int32 = (cse_var_20 + 3)
-          let cse_var_10: int32 = (cse_var_20 + 4)
-          let cse_var_9: int32 = (cse_var_20 + 5)
-          let cse_var_8: int32 = (cse_var_20 + 6)
-          let cse_var_7: int32 = (cse_var_20 + 7)
-          let cse_var_6: int32 = (cse_var_20 + 8)
-          let cse_var_5: int32 = (cse_var_20 + 9)
-          let cse_var_4: int32 = (cse_var_20 + 1)
-          let cse_var_3: int32 = ((floordiv(i0.outer.i1.outer.fused, 32)*4096) + (i.inner*256))
-           {
-            compute_4[cse_var_20] = (compute_4[cse_var_20] + (placeholder_1[((placeholder_3[cse_var_21]*16) + cse_var_19)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_4] = (compute_4[cse_var_4] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 1)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_12] = (compute_4[cse_var_12] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 2)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_11] = (compute_4[cse_var_11] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 3)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_10] = (compute_4[cse_var_10] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 4)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_9] = (compute_4[cse_var_9] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 5)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_8] = (compute_4[cse_var_8] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 6)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_7] = (compute_4[cse_var_7] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 7)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_6] = (compute_4[cse_var_6] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 8)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_5] = (compute_4[cse_var_5] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 9)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_18] = (compute_4[cse_var_18] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 10)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_17] = (compute_4[cse_var_17] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 11)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_16] = (compute_4[cse_var_16] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 12)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_15] = (compute_4[cse_var_15] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 13)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_14] = (compute_4[cse_var_14] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 14)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
-            compute_4[cse_var_13] = (compute_4[cse_var_13] + (placeholder_1[(((placeholder_3[cse_var_21]*16) + cse_var_19) + 15)]*max(placeholder[(cse_var_3 + placeholder_2[(placeholder_3[cse_var_21] + elem_idx)])], 0f32)))
+  for (i0.outer.i1.outer.fused: int32, 0, 16) &quot;parallel&quot; {
+    allocate(compute_3: Pointer(global float32), float32, [4096]), storage_scope = global {
+      for (i.outer.inner: int32, 0, 8) {
+        for (nb_j.inner: int32, 0, 2) {
+          for (i.inner.init: int32, 0, 16) {
+            for (j.init: int32, 0, 16) {
+              compute_4: Buffer(compute_3, float32, [4096], [])[((((i.outer.inner*512) + (i.inner.init*32)) + (nb_j.inner*16)) + j.init)] = 0f32
+            }
+          }
+          for (elem_idx: int32, 0, let cse_var_1: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner) in (placeholder_3[(cse_var_1 + 1)] - placeholder_3[cse_var_1])) {
+            for (i.inner: int32, 0, 16) {
+              for (j: int32, 0, 16) {
+                let cse_var_3: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner)
+                let cse_var_2: int32 = ((((i.outer.inner*512) + (i.inner*32)) + (nb_j.inner*16)) + j)
+                compute_4[cse_var_2] = (compute_4[cse_var_2] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + j)]*max(placeholder[(((i.outer.inner*4096) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
+              }
+            }
           }
         }
       }
-      for (i0.inner: int32, 0, 16) {
-        let cse_var_22: int32 = (((floordiv(i0.outer.i1.outer.fused, 32)*8192) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 32)*16))
-        compute[ramp(cse_var_22, 1, 16)] = max((compute_4[ramp((i0.inner*16), 1, 16)] + placeholder_4[ramp(cse_var_22, 1, 16)]), broadcast(0f32, 16))
+      for (i0.inner: int32, 0, 128) {
+        let cse_var_4: int32 = ((i0.inner*512) + (i0.outer.i1.outer.fused*32))
+        compute[ramp(cse_var_4, 1, 32)] = max((compute_4[ramp((i0.inner*32), 1, 32)] + placeholder_4[ramp(cse_var_4, 1, 32)]), broadcast(0f32, 32))
       }
     }
   }
@@ -705,7 +661,7 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 1.919 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 1.497 ms
 </pre></div>
 </div>
 <div class="admonition note">
diff --git a/docs/how_to/tune_with_autotvm/sg_execution_times.html b/docs/how_to/tune_with_autotvm/sg_execution_times.html
index 0cd6b9744..96079eccb 100644
--- a/docs/how_to/tune_with_autotvm/sg_execution_times.html
+++ b/docs/how_to/tune_with_autotvm/sg_execution_times.html
@@ -300,13 +300,13 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-tune-with-autotvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:43.833</strong> total execution time for <strong>how_to_tune_with_autotvm</strong> files:</p>
+<p><strong>00:43.875</strong> total execution time for <strong>how_to_tune_with_autotvm</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:42.987</strong>: <a class="reference internal" href="tune_conv2d_cuda.html#sphx-glr-how-to-tune-with-autotvm-tune-conv2d-cuda-py"><span class="std std-ref">Tuning High Performance Convolution on NVIDIA GPUs</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_cuda.py</span></code>)</p></li>
-<li><p><strong>00:00.221</strong>: <a class="reference internal" href="tune_relay_x86.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-x86-py"><span class="std std-ref">Auto-tuning a Convolutional Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_x86.py</span></code>)</p></li>
-<li><p><strong>00:00.210</strong>: <a class="reference internal" href="tune_relay_arm.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-arm-py"><span class="std std-ref">Auto-tuning a Convolutional Network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_arm.py</span></code>)</p></li>
-<li><p><strong>00:00.207</strong>: <a class="reference internal" href="tune_relay_cuda.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-cuda-py"><span class="std std-ref">Auto-tuning a Convolutional Network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_cuda.py</span></code>)</p></li>
-<li><p><strong>00:00.207</strong>: <a class="reference internal" href="tune_relay_mobile_gpu.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-mobile-gpu-py"><span class="std std-ref">Auto-tuning a Convolutional Network for Mobile GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_mobile_gpu.py</span></code>)</p></li>
+<li><p><strong>00:43.044</strong>: <a class="reference internal" href="tune_conv2d_cuda.html#sphx-glr-how-to-tune-with-autotvm-tune-conv2d-cuda-py"><span class="std std-ref">Tuning High Performance Convolution on NVIDIA GPUs</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_cuda.py</span></code>)</p></li>
+<li><p><strong>00:00.220</strong>: <a class="reference internal" href="tune_relay_x86.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-x86-py"><span class="std std-ref">Auto-tuning a Convolutional Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_x86.py</span></code>)</p></li>
+<li><p><strong>00:00.205</strong>: <a class="reference internal" href="tune_relay_mobile_gpu.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-mobile-gpu-py"><span class="std std-ref">Auto-tuning a Convolutional Network for Mobile GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_mobile_gpu.py</span></code>)</p></li>
+<li><p><strong>00:00.204</strong>: <a class="reference internal" href="tune_relay_cuda.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-cuda-py"><span class="std std-ref">Auto-tuning a Convolutional Network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_cuda.py</span></code>)</p></li>
+<li><p><strong>00:00.202</strong>: <a class="reference internal" href="tune_relay_arm.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-arm-py"><span class="std std-ref">Auto-tuning a Convolutional Network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_arm.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html b/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
index 32dfc93b3..c30a3c6bc 100644
--- a/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
+++ b/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
@@ -1142,8 +1142,8 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 4, 4, 32]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 1, 128]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 0)],None,2885496
-No: 6   GFLOPS: 63.27/63.27     result: MeasureResult(costs=(0.0036591528999999996,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.580784797668457, timestamp=1650045397.8798823)       [(&#39;tile_f&#39;, [-1, 1, 1, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 4, 4]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,3754080
-No: 7   GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 6   GFLOPS: 42.32/42.32     result: MeasureResult(costs=(0.005470547157894737,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.5707504749298096, timestamp=1650045818.0508544)       [(&#39;tile_f&#39;, [-1, 1, 1, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 4, 4]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,3754080
+No: 7   GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 571, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 523, in _build_func_common
@@ -1266,7 +1266,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 1, 16, 32]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 256, 1]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,6225319
-No: 8   GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 8   GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 571, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 523, in _build_func_common
@@ -1389,7 +1389,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 2, 1, 32]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 8, 64]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 0)],None,943546
-No: 9   GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 9   GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 571, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 523, in _build_func_common
@@ -1512,7 +1512,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 4, 16, 4]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 1, 1, 7]), (&#39;tile_rc&#39;, [-1, 16, 32]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 0)],None,2868708
-No: 10  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 10  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 142, in build
     res = future.result()
   File &quot;/usr/lib/python3.7/concurrent/futures/_base.py&quot;, line 435, in result
@@ -1530,7 +1530,7 @@ No: 10  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
 TimeoutError
 
         [(&#39;tile_f&#39;, [-1, 32, 2, 4]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 7]), (&#39;tile_rc&#39;, [-1, 4, 2]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,4691833
-No: 11  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 11  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 571, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 523, in _build_func_common
@@ -1653,7 +1653,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 1, 2, 64]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 4]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 0)],None,1042124
-No: 12  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 12  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 571, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 523, in _build_func_common
@@ -1776,7 +1776,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 32, 1, 4]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 32, 16]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,10013405
-No: 13  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 13  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 571, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 523, in _build_func_common
@@ -1899,7 +1899,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 8, 8, 2]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 4, 32]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,6732082
-No: 14  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 14  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 571, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 523, in _build_func_common
@@ -2022,7 +2022,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 2, 4, 32]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 128]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 1)],None,7536735
-No: 15  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 15  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 571, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 523, in _build_func_common
@@ -2145,7 +2145,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 2, 1, 4]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 1, 1, 7]), (&#39;tile_rc&#39;, [-1, 128, 4]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 0)],None,482121
-No: 16  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 16  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 571, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 523, in _build_func_common
@@ -2268,7 +2268,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 2, 1, 16]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 32, 8]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 0)],None,2824525
-No: 17  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 17  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 571, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 523, in _build_func_common
@@ -2391,7 +2391,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 64, 1, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 8, 8]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,4559286
-No: 18  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 18  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 571, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 523, in _build_func_common
@@ -2514,7 +2514,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 854, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 1, 32, 16]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 1, 512]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,9677544
-No: 19  GFLOPS: 0.00/63.27      result: Traceback (most recent call last):
+No: 19  GFLOPS: 0.00/42.32      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 721, in __call__
     yield remote, remote.load_module(os.path.split(build_result.filename)[1])
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 685, in run_through_rpc
@@ -2602,7 +2602,7 @@ tvm._ffi.base.TVMError: Traceback (most recent call last):
   15: _PyEval_EvalFrameDefault
   14: 0x0000000000537c30
   13: _PyObject_FastCallKeywords
-  12: 0x00007f8505170fa2
+  12: 0x00007f0a9c06afa2
   11: _ctypes_callproc
   10: ffi_call
   9: ffi_call_unix64
@@ -2667,7 +2667,7 @@ Traceback (most recent call last):
   21: _PyFunction_FastCallKeywords
   20: _PyEval_EvalFrameDefault
   19: _PyFunction_FastCall      [(&#39;tile_f&#39;, [-1, 8, 2, 16]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 1, 1]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,6390073
-No: 20  GFLOPS: 145.17/145.17   result: MeasureResult(costs=(0.00159474064,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4268317222595215, timestamp=1650045424.2071035)      [(&#39;tile_f&#39;, [-1, 1, 4, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 1]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,9881539
+No: 20  GFLOPS: 144.59/144.59   result: MeasureResult(costs=(0.0016010948300000003,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4313409328460693, timestamp=1650045844.3790898)      [(&#39;tile_f&#39;, [-1, 1, 4, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 1]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,9881539
 </pre></div>
 </div>
 <p>Finally we can inspect the best config from log file, check correctness,
@@ -2706,7 +2706,7 @@ and measure running time.</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Best config:
 [(&#39;tile_f&#39;, [-1, 1, 4, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 1]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,9881539
-Time cost of this operator: 0.001982
+Time cost of this operator: 0.001960
 </pre></div>
 </div>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autotvm-tune-conv2d-cuda-py">
diff --git a/docs/how_to/work_with_microtvm/micro_autotune.html b/docs/how_to/work_with_microtvm/micro_autotune.html
index f8616a02c..2aea59c3b 100644
--- a/docs/how_to/work_with_microtvm/micro_autotune.html
+++ b/docs/how_to/work_with_microtvm/micro_autotune.html
@@ -553,10 +553,10 @@ the tuned operator.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>########## Build without Autotuning ##########
 Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs
 ---------                                     ---                                           --------  -------  -----              ------  -------
-tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  313.6     98.771   (1, 2, 10, 10, 3)  2       1
-tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.0       0.945    (1, 6, 10, 10)     1       1
-tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.901     0.284    (1, 1, 10, 10, 3)  1       1
-Total_time                                    -                                             317.501   -        -                  -       -
+tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  334.0     98.794   (1, 2, 10, 10, 3)  2       1
+tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.131     0.926    (1, 6, 10, 10)     1       1
+tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.944     0.279    (1, 1, 10, 10, 3)  1       1
+Total_time                                    -                                             338.076   -        -                  -       -
 </pre></div>
 </div>
 </div>
@@ -608,10 +608,10 @@ Total_time                                    -
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>########## Build with Autotuning ##########
 Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs
 ---------                                     ---                                           --------  -------  -----              ------  -------
-tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  81.2      96.83    (1, 6, 10, 10, 1)  2       1
-tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.738     2.072    (1, 6, 10, 10)     1       1
-tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.92      1.097    (1, 1, 10, 10, 3)  1       1
-Total_time                                    -                                             83.858    -        -                  -       -
+tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  81.3      96.795   (1, 6, 10, 10, 1)  2       1
+tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.768     2.105    (1, 6, 10, 10)     1       1
+tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.924     1.1      (1, 1, 10, 10, 3)  1       1
+Total_time                                    -                                             83.992    -        -                  -       -
 </pre></div>
 </div>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-work-with-microtvm-micro-autotune-py">
diff --git a/docs/how_to/work_with_microtvm/sg_execution_times.html b/docs/how_to/work_with_microtvm/sg_execution_times.html
index b5f983b2c..bf8532fbb 100644
--- a/docs/how_to/work_with_microtvm/sg_execution_times.html
+++ b/docs/how_to/work_with_microtvm/sg_execution_times.html
@@ -300,12 +300,12 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-work-with-microtvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:44.076</strong> total execution time for <strong>how_to_work_with_microtvm</strong> files:</p>
+<p><strong>00:43.541</strong> total execution time for <strong>how_to_work_with_microtvm</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:40.085</strong>: <a class="reference internal" href="micro_autotune.html#sphx-glr-how-to-work-with-microtvm-micro-autotune-py"><span class="std std-ref">Autotuning with microTVM</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_autotune.py</span></code>)</p></li>
-<li><p><strong>00:03.422</strong>: <a class="reference internal" href="micro_tflite.html#sphx-glr-how-to-work-with-microtvm-micro-tflite-py"><span class="std std-ref">microTVM with TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tflite.py</span></code>)</p></li>
-<li><p><strong>00:00.194</strong>: <a class="reference internal" href="micro_ethosu.html#sphx-glr-how-to-work-with-microtvm-micro-ethosu-py"><span class="std std-ref">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_ethosu.py</span></code>)</p></li>
-<li><p><strong>00:00.194</strong>: <a class="reference internal" href="micro_tvmc.html#sphx-glr-how-to-work-with-microtvm-micro-tvmc-py"><span class="std std-ref">Executing a Tiny Model with TVMC Micro</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tvmc.py</span></code>)</p></li>
+<li><p><strong>00:39.586</strong>: <a class="reference internal" href="micro_autotune.html#sphx-glr-how-to-work-with-microtvm-micro-autotune-py"><span class="std std-ref">Autotuning with microTVM</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_autotune.py</span></code>)</p></li>
+<li><p><strong>00:03.405</strong>: <a class="reference internal" href="micro_tflite.html#sphx-glr-how-to-work-with-microtvm-micro-tflite-py"><span class="std std-ref">microTVM with TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tflite.py</span></code>)</p></li>
+<li><p><strong>00:00.187</strong>: <a class="reference internal" href="micro_ethosu.html#sphx-glr-how-to-work-with-microtvm-micro-ethosu-py"><span class="std std-ref">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_ethosu.py</span></code>)</p></li>
+<li><p><strong>00:00.182</strong>: <a class="reference internal" href="micro_tvmc.html#sphx-glr-how-to-work-with-microtvm-micro-tvmc-py"><span class="std std-ref">Executing a Tiny Model with TVMC Micro</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tvmc.py</span></code>)</p></li>
 <li><p><strong>00:00.181</strong>: <a class="reference internal" href="micro_reference_vm.html#sphx-glr-how-to-work-with-microtvm-micro-reference-vm-py"><span class="std std-ref">microTVM Reference Virtual Machines</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_reference_vm.py</span></code>)</p></li>
 </ul>
 </div>
diff --git a/docs/how_to/work_with_relay/sg_execution_times.html b/docs/how_to/work_with_relay/sg_execution_times.html
index 18c5d42c8..3551dd98c 100644
--- a/docs/how_to/work_with_relay/sg_execution_times.html
+++ b/docs/how_to/work_with_relay/sg_execution_times.html
@@ -300,11 +300,11 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-work-with-relay-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:08.594</strong> total execution time for <strong>how_to_work_with_relay</strong> files:</p>
+<p><strong>00:08.670</strong> total execution time for <strong>how_to_work_with_relay</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:06.823</strong>: <a class="reference internal" href="using_external_lib.html#sphx-glr-how-to-work-with-relay-using-external-lib-py"><span class="std std-ref">Using External Libraries in Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_external_lib.py</span></code>)</p></li>
-<li><p><strong>00:01.563</strong>: <a class="reference internal" href="build_gcn.html#sphx-glr-how-to-work-with-relay-build-gcn-py"><span class="std std-ref">Building a Graph Convolutional Network</span></a> (<code class="docutils literal notranslate"><span class="pre">build_gcn.py</span></code>)</p></li>
-<li><p><strong>00:00.208</strong>: <a class="reference internal" href="using_relay_viz.html#sphx-glr-how-to-work-with-relay-using-relay-viz-py"><span class="std std-ref">Use Relay Visualizer to Visualize Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_relay_viz.py</span></code>)</p></li>
+<li><p><strong>00:06.816</strong>: <a class="reference internal" href="using_external_lib.html#sphx-glr-how-to-work-with-relay-using-external-lib-py"><span class="std std-ref">Using External Libraries in Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_external_lib.py</span></code>)</p></li>
+<li><p><strong>00:01.650</strong>: <a class="reference internal" href="build_gcn.html#sphx-glr-how-to-work-with-relay-build-gcn-py"><span class="std std-ref">Building a Graph Convolutional Network</span></a> (<code class="docutils literal notranslate"><span class="pre">build_gcn.py</span></code>)</p></li>
+<li><p><strong>00:00.204</strong>: <a class="reference internal" href="using_relay_viz.html#sphx-glr-how-to-work-with-relay-using-relay-viz-py"><span class="std std-ref">Use Relay Visualizer to Visualize Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_relay_viz.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/how_to/work_with_schedules/sg_execution_times.html b/docs/how_to/work_with_schedules/sg_execution_times.html
index eee384103..238a91038 100644
--- a/docs/how_to/work_with_schedules/sg_execution_times.html
+++ b/docs/how_to/work_with_schedules/sg_execution_times.html
@@ -300,16 +300,16 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-work-with-schedules-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:05.519</strong> total execution time for <strong>how_to_work_with_schedules</strong> files:</p>
+<p><strong>00:05.359</strong> total execution time for <strong>how_to_work_with_schedules</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:02.027</strong>: <a class="reference internal" href="intrin_math.html#sphx-glr-how-to-work-with-schedules-intrin-math-py"><span class="std std-ref">Intrinsics and Math Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">intrin_math.py</span></code>)</p></li>
-<li><p><strong>00:01.104</strong>: <a class="reference internal" href="tensorize.html#sphx-glr-how-to-work-with-schedules-tensorize-py"><span class="std std-ref">Use Tensorize to Leverage Hardware Intrinsics</span></a> (<code class="docutils literal notranslate"><span class="pre">tensorize.py</span></code>)</p></li>
-<li><p><strong>00:00.711</strong>: <a class="reference internal" href="reduction.html#sphx-glr-how-to-work-with-schedules-reduction-py"><span class="std std-ref">Reduction</span></a> (<code class="docutils literal notranslate"><span class="pre">reduction.py</span></code>)</p></li>
-<li><p><strong>00:00.693</strong>: <a class="reference internal" href="scan.html#sphx-glr-how-to-work-with-schedules-scan-py"><span class="std std-ref">Scan and Recurrent Kernel</span></a> (<code class="docutils literal notranslate"><span class="pre">scan.py</span></code>)</p></li>
-<li><p><strong>00:00.307</strong>: <a class="reference internal" href="extern_op.html#sphx-glr-how-to-work-with-schedules-extern-op-py"><span class="std std-ref">External Tensor Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">extern_op.py</span></code>)</p></li>
-<li><p><strong>00:00.233</strong>: <a class="reference internal" href="tedd.html#sphx-glr-how-to-work-with-schedules-tedd-py"><span class="std std-ref">Use Tensor Expression Debug Display (TEDD) for Visualization</span></a> (<code class="docutils literal notranslate"><span class="pre">tedd.py</span></code>)</p></li>
-<li><p><strong>00:00.227</strong>: <a class="reference internal" href="schedule_primitives.html#sphx-glr-how-to-work-with-schedules-schedule-primitives-py"><span class="std std-ref">Schedule Primitives in TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">schedule_primitives.py</span></code>)</p></li>
-<li><p><strong>00:00.217</strong>: <a class="reference internal" href="tuple_inputs.html#sphx-glr-how-to-work-with-schedules-tuple-inputs-py"><span class="std std-ref">Compute and Reduce with Tuple Inputs</span></a> (<code class="docutils literal notranslate"><span class="pre">tuple_inputs.py</span></code>)</p></li>
+<li><p><strong>00:02.004</strong>: <a class="reference internal" href="intrin_math.html#sphx-glr-how-to-work-with-schedules-intrin-math-py"><span class="std std-ref">Intrinsics and Math Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">intrin_math.py</span></code>)</p></li>
+<li><p><strong>00:01.049</strong>: <a class="reference internal" href="tensorize.html#sphx-glr-how-to-work-with-schedules-tensorize-py"><span class="std std-ref">Use Tensorize to Leverage Hardware Intrinsics</span></a> (<code class="docutils literal notranslate"><span class="pre">tensorize.py</span></code>)</p></li>
+<li><p><strong>00:00.694</strong>: <a class="reference internal" href="reduction.html#sphx-glr-how-to-work-with-schedules-reduction-py"><span class="std std-ref">Reduction</span></a> (<code class="docutils literal notranslate"><span class="pre">reduction.py</span></code>)</p></li>
+<li><p><strong>00:00.675</strong>: <a class="reference internal" href="scan.html#sphx-glr-how-to-work-with-schedules-scan-py"><span class="std std-ref">Scan and Recurrent Kernel</span></a> (<code class="docutils literal notranslate"><span class="pre">scan.py</span></code>)</p></li>
+<li><p><strong>00:00.292</strong>: <a class="reference internal" href="extern_op.html#sphx-glr-how-to-work-with-schedules-extern-op-py"><span class="std std-ref">External Tensor Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">extern_op.py</span></code>)</p></li>
+<li><p><strong>00:00.222</strong>: <a class="reference internal" href="schedule_primitives.html#sphx-glr-how-to-work-with-schedules-schedule-primitives-py"><span class="std std-ref">Schedule Primitives in TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">schedule_primitives.py</span></code>)</p></li>
+<li><p><strong>00:00.218</strong>: <a class="reference internal" href="tedd.html#sphx-glr-how-to-work-with-schedules-tedd-py"><span class="std std-ref">Use Tensor Expression Debug Display (TEDD) for Visualization</span></a> (<code class="docutils literal notranslate"><span class="pre">tedd.py</span></code>)</p></li>
+<li><p><strong>00:00.203</strong>: <a class="reference internal" href="tuple_inputs.html#sphx-glr-how-to-work-with-schedules-tuple-inputs-py"><span class="std std-ref">Compute and Reduce with Tuple Inputs</span></a> (<code class="docutils literal notranslate"><span class="pre">tuple_inputs.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/how_to/work_with_schedules/tensorize.html b/docs/how_to/work_with_schedules/tensorize.html
index db0f0137a..0228f754a 100644
--- a/docs/how_to/work_with_schedules/tensorize.html
+++ b/docs/how_to/work_with_schedules/tensorize.html
@@ -548,7 +548,7 @@ The importing needs to happen before the tensorized GEMV being executed.</p>
              B: Buffer(B_2: Pointer(float32), float32, [32768], []),
              C: Buffer(C_2: Pointer(float32), float32, [524288], [])}
   buffer_map = {A_1: A, B_1: B, C_1: C} {
-  attr [IterVar(i: int32, (nullptr), &quot;DataPar&quot;, &quot;&quot;)] &quot;pragma_import_llvm&quot; = &quot;; ModuleID = &#39;/tmp/tmpm85nq5e4/input0.cc&#39;\nsource_filename = \&quot;/tmp/tmpm85nq5e4/input0.cc\&quot;\ntarget datalayout = \&quot;e-m:e-i64:64-f80:128-n8:16:32:64-S128\&quot;\ntarget triple = \&quot;x86_64-pc-linux-gnu\&quot;\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = allo [...]
+  attr [IterVar(i: int32, (nullptr), &quot;DataPar&quot;, &quot;&quot;)] &quot;pragma_import_llvm&quot; = &quot;; ModuleID = &#39;/tmp/tmpra43z4sb/input0.cc&#39;\nsource_filename = \&quot;/tmp/tmpra43z4sb/input0.cc\&quot;\ntarget datalayout = \&quot;e-m:e-i64:64-f80:128-n8:16:32:64-S128\&quot;\ntarget triple = \&quot;x86_64-pc-linux-gnu\&quot;\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = allo [...]
   for (i, 0, 1024) {
     for (j.outer: int32, 0, 32) {
       @tir.call_extern(&quot;gemv_update&quot;, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), C_2, ((i*512) + (j.outer*16)), 16, 2, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), A_2, (i*64), 64, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), B_2, (j.outer*1024), 1024, 1, dtype=handle), 16, 64, 64, dtype=int32)
diff --git a/docs/reference/api/doxygen/iter__affine__map_8h.html b/docs/reference/api/doxygen/iter__affine__map_8h.html
index ec47172b1..1ef8a4b74 100644
--- a/docs/reference/api/doxygen/iter__affine__map_8h.html
+++ b/docs/reference/api/doxygen/iter__affine__map_8h.html
@@ -124,9 +124,9 @@ Namespaces</h2></td></tr>
 </table><table class="memberdecls">
 <tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="func-members"></a>
 Functions</h2></td></tr>
-<tr class="memitem:a60ff187f559dba2d570d6a96f2fced15"><td class="memItemLeft" align="right" valign="top">Array&lt; IterSumExpr &gt;&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm_1_1arith.html#a60ff187f559dba2d570d6a96f2fced15">tvm::arith::DetectIterMap</a> (const Array&lt; PrimExpr &gt; &amp;indices, const Map&lt; Var, Range &gt; &amp;input_iters, const PrimExpr &amp;predicate, bool require_bijective, arith::Analyzer *analyzer)</td></tr>
-<tr class="memdesc:a60ff187f559dba2d570d6a96f2fced15"><td class="mdescLeft">&#160;</td><td class="mdescRight">Detect if indices can be written as [y_0 + c_0, y_1 + c_1, ..., y_n + c_n].  <a href="namespacetvm_1_1arith.html#a60ff187f559dba2d570d6a96f2fced15">More...</a><br /></td></tr>
-<tr class="separator:a60ff187f559dba2d570d6a96f2fced15"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:ab1eb48d4326e3a530b3d77f391e83175"><td class="memItemLeft" align="right" valign="top">Array&lt; IterSumExpr &gt;&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm_1_1arith.html#ab1eb48d4326e3a530b3d77f391e83175">tvm::arith::DetectIterMap</a> (const Array&lt; PrimExpr &gt; &amp;indices, const Map&lt; Var, Range &gt; &amp;input_iters, const PrimExpr &amp;predicate, bool require_bijective, arith::Analyzer *analyzer, bool simplify_trivial_ [...]
+<tr class="memdesc:ab1eb48d4326e3a530b3d77f391e83175"><td class="mdescLeft">&#160;</td><td class="mdescRight">Detect if indices can be written as [y_0 + c_0, y_1 + c_1, ..., y_n + c_n].  <a href="namespacetvm_1_1arith.html#ab1eb48d4326e3a530b3d77f391e83175">More...</a><br /></td></tr>
+<tr class="separator:ab1eb48d4326e3a530b3d77f391e83175"><td class="memSeparator" colspan="2">&#160;</td></tr>
 <tr class="memitem:ab26374719c9dc2fe371f684ff8a33474"><td class="memItemLeft" align="right" valign="top">Array&lt; PrimExpr &gt;&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm_1_1arith.html#ab26374719c9dc2fe371f684ff8a33474">tvm::arith::IterMapSimplify</a> (const Array&lt; PrimExpr &gt; &amp;indices, const Map&lt; Var, Range &gt; &amp;input_iters, const PrimExpr &amp;input_pred, bool require_bijective)</td></tr>
 <tr class="memdesc:ab26374719c9dc2fe371f684ff8a33474"><td class="mdescLeft">&#160;</td><td class="mdescRight">Use IterVarMap detector to rewrite and simplify the indices.  <a href="namespacetvm_1_1arith.html#ab26374719c9dc2fe371f684ff8a33474">More...</a><br /></td></tr>
 <tr class="separator:ab26374719c9dc2fe371f684ff8a33474"><td class="memSeparator" colspan="2">&#160;</td></tr>
diff --git a/docs/reference/api/doxygen/iter__affine__map_8h_source.html b/docs/reference/api/doxygen/iter__affine__map_8h_source.html
index 3d1bcaf43..1b81873f0 100644
--- a/docs/reference/api/doxygen/iter__affine__map_8h_source.html
+++ b/docs/reference/api/doxygen/iter__affine__map_8h_source.html
@@ -66,7 +66,7 @@ $(function() {
 <div class="title">iter_affine_map.h</div>  </div>
 </div><!--header-->
 <div class="contents">
-<a href="iter__affine__map_8h.html">Go to the documentation of this file.</a><div class="fragment"><div class="line"><a name="l00001"></a><span class="lineno">    1</span>&#160;<span class="comment">/*</span></div><div class="line"><a name="l00002"></a><span class="lineno">    2</span>&#160;<span class="comment"> * Licensed to the Apache Software Foundation (ASF) under one</span></div><div class="line"><a name="l00003"></a><span class="lineno">    3</span>&#160;<span class="comment"> * o [...]
+<a href="iter__affine__map_8h.html">Go to the documentation of this file.</a><div class="fragment"><div class="line"><a name="l00001"></a><span class="lineno">    1</span>&#160;<span class="comment">/*</span></div><div class="line"><a name="l00002"></a><span class="lineno">    2</span>&#160;<span class="comment"> * Licensed to the Apache Software Foundation (ASF) under one</span></div><div class="line"><a name="l00003"></a><span class="lineno">    3</span>&#160;<span class="comment"> * o [...]
 <div class="ttc" id="classtvm_1_1arith_1_1IterSplitExpr_html"><div class="ttname"><a href="classtvm_1_1arith_1_1IterSplitExpr.html">tvm::arith::IterSplitExpr</a></div><div class="ttdoc">Managed reference to IterSplitExprNode. </div><div class="ttdef"><b>Definition:</b> iter_affine_map.h:187</div></div>
 <div class="ttc" id="classtvm_1_1arith_1_1IterSumExprNode_html"><div class="ttname"><a href="classtvm_1_1arith_1_1IterSumExprNode.html">tvm::arith::IterSumExprNode</a></div><div class="ttdoc">Fuse multiple iterators by summing them with scaling. </div><div class="ttdef"><b>Definition:</b> iter_affine_map.h:219</div></div>
 <div class="ttc" id="classtvm_1_1arith_1_1IterSumExprNode_html_a7545b5a6fa94181b7d9a36c9591ded47"><div class="ttname"><a href="classtvm_1_1arith_1_1IterSumExprNode.html#a7545b5a6fa94181b7d9a36c9591ded47">tvm::arith::IterSumExprNode::SEqualReduce</a></div><div class="ttdeci">bool SEqualReduce(const IterSumExprNode *other, SEqualReducer equal) const</div><div class="ttdef"><b>Definition:</b> iter_affine_map.h:232</div></div>
@@ -96,6 +96,7 @@ $(function() {
 <div class="ttc" id="classtvm_1_1arith_1_1IterSumExprNode_html_adf293d649d6c073dafcd0dad75bb4505"><div class="ttname"><a href="classtvm_1_1arith_1_1IterSumExprNode.html#adf293d649d6c073dafcd0dad75bb4505">tvm::arith::IterSumExprNode::VisitAttrs</a></div><div class="ttdeci">void VisitAttrs(tvm::AttrVisitor *v)</div><div class="ttdef"><b>Definition:</b> iter_affine_map.h:227</div></div>
 <div class="ttc" id="classtvm_1_1arith_1_1IterSplitExprNode_html_afe50d660cd72a521455f9cbfb5ac77ff"><div class="ttname"><a href="classtvm_1_1arith_1_1IterSplitExprNode.html#afe50d660cd72a521455f9cbfb5ac77ff">tvm::arith::IterSplitExprNode::extent</a></div><div class="ttdeci">PrimExpr extent</div><div class="ttdoc">The extent of the split. </div><div class="ttdef"><b>Definition:</b> iter_affine_map.h:155</div></div>
 <div class="ttc" id="namespacetvm_1_1arith_html_af5e4b93476f56fc5d68dbba17859660e"><div class="ttname"><a href="namespacetvm_1_1arith.html#af5e4b93476f56fc5d68dbba17859660e">tvm::arith::NormalizeIterMapToExpr</a></div><div class="ttdeci">PrimExpr NormalizeIterMapToExpr(const IterMapExpr &amp;expr)</div><div class="ttdoc">Given an IterMapExpr, transform it to normal PrimExpr. </div></div>
+<div class="ttc" id="namespacetvm_1_1arith_html_ab1eb48d4326e3a530b3d77f391e83175"><div class="ttname"><a href="namespacetvm_1_1arith.html#ab1eb48d4326e3a530b3d77f391e83175">tvm::arith::DetectIterMap</a></div><div class="ttdeci">Array&lt; IterSumExpr &gt; DetectIterMap(const Array&lt; PrimExpr &gt; &amp;indices, const Map&lt; Var, Range &gt; &amp;input_iters, const PrimExpr &amp;predicate, bool require_bijective, arith::Analyzer *analyzer, bool simplify_trivial_iterators=true)</div><div  [...]
 <div class="ttc" id="classtvm_1_1arith_1_1IterSumExprNode_html_a924fe4b31a34f7b0cb489e87ba2c8d24"><div class="ttname"><a href="classtvm_1_1arith_1_1IterSumExprNode.html#a924fe4b31a34f7b0cb489e87ba2c8d24">tvm::arith::IterSumExprNode::args</a></div><div class="ttdeci">Array&lt; IterSplitExpr &gt; args</div><div class="ttdoc">The args to the sum. </div><div class="ttdef"><b>Definition:</b> iter_affine_map.h:222</div></div>
 <div class="ttc" id="classtvm_1_1SHashReducer_1_1Handler_html_a8f9a489881fc55552f13a58313a863cf"><div class="ttname"><a href="classtvm_1_1SHashReducer_1_1Handler.html#a8f9a489881fc55552f13a58313a863cf">tvm::SHashReducer::Handler::MarkGraphNode</a></div><div class="ttdeci">virtual void MarkGraphNode()=0</div><div class="ttdoc">Mark current comparison as graph node in hashing. Graph node hash will depends on the graph structure...</div></div>
 <div class="ttc" id="classtvm_1_1arith_1_1IterMarkNode_html_a5afb11ef3b40b09b086214c156bb3d5c"><div class="ttname"><a href="classtvm_1_1arith_1_1IterMarkNode.html#a5afb11ef3b40b09b086214c156bb3d5c">tvm::arith::IterMarkNode::SEqualReduce</a></div><div class="ttdeci">bool SEqualReduce(const IterMarkNode *other, SEqualReducer equal) const</div><div class="ttdef"><b>Definition:</b> iter_affine_map.h:109</div></div>
@@ -105,7 +106,6 @@ $(function() {
 <div class="ttc" id="classtvm_1_1arith_1_1IterSplitExprNode_html_a7a129dc9b432359a07c1a1e286c3c66f"><div class="ttname"><a href="classtvm_1_1arith_1_1IterSplitExprNode.html#a7a129dc9b432359a07c1a1e286c3c66f">tvm::arith::IterSplitExprNode::source</a></div><div class="ttdeci">IterMark source</div><div class="ttdoc">The source marked iterator. </div><div class="ttdef"><b>Definition:</b> iter_affine_map.h:151</div></div>
 <div class="ttc" id="object_8h_html_ac6e7295a4999e2c8e4a2c990beca887a"><div class="ttname"><a href="object_8h.html#ac6e7295a4999e2c8e4a2c990beca887a">TVM_DEFINE_OBJECT_REF_METHODS</a></div><div class="ttdeci">#define TVM_DEFINE_OBJECT_REF_METHODS(TypeName, ParentType, ObjectName)</div><div class="ttdef"><b>Definition:</b> object.h:713</div></div>
 <div class="ttc" id="classtvm_1_1arith_1_1IterMark_html"><div class="ttname"><a href="classtvm_1_1arith_1_1IterMark.html">tvm::arith::IterMark</a></div><div class="ttdoc">Managed reference to IterMarkExprNode. </div><div class="ttdef"><b>Definition:</b> iter_affine_map.h:130</div></div>
-<div class="ttc" id="namespacetvm_1_1arith_html_a60ff187f559dba2d570d6a96f2fced15"><div class="ttname"><a href="namespacetvm_1_1arith.html#a60ff187f559dba2d570d6a96f2fced15">tvm::arith::DetectIterMap</a></div><div class="ttdeci">Array&lt; IterSumExpr &gt; DetectIterMap(const Array&lt; PrimExpr &gt; &amp;indices, const Map&lt; Var, Range &gt; &amp;input_iters, const PrimExpr &amp;predicate, bool require_bijective, arith::Analyzer *analyzer)</div><div class="ttdoc">Detect if indices can be [...]
 <div class="ttc" id="classtvm_1_1runtime_1_1ObjectRef_html"><div class="ttname"><a href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></div><div class="ttdoc">Base class of all object reference. </div><div class="ttdef"><b>Definition:</b> object.h:511</div></div>
 <div class="ttc" id="object_8h_html_af8330e3864503fb7c4133ae4d48fe4a2"><div class="ttname"><a href="object_8h.html#af8330e3864503fb7c4133ae4d48fe4a2">TVM_DEFINE_OBJECT_REF_COW_METHOD</a></div><div class="ttdeci">#define TVM_DEFINE_OBJECT_REF_COW_METHOD(ObjectName)</div><div class="ttdoc">Define CopyOnWrite function in an ObjectRef. </div><div class="ttdef"><b>Definition:</b> object.h:785</div></div>
 <div class="ttc" id="classtvm_1_1arith_1_1IterMarkNode_html"><div class="ttname"><a href="classtvm_1_1arith_1_1IterMarkNode.html">tvm::arith::IterMarkNode</a></div><div class="ttdoc">Mark the source as an iterator in [0, extent). </div><div class="ttdef"><b>Definition:</b> iter_affine_map.h:91</div></div>
diff --git a/docs/reference/api/doxygen/namespacemembers_d.html b/docs/reference/api/doxygen/namespacemembers_d.html
index 2520c91a1..18fdbac53 100644
--- a/docs/reference/api/doxygen/namespacemembers_d.html
+++ b/docs/reference/api/doxygen/namespacemembers_d.html
@@ -124,7 +124,7 @@ $(function() {
 : <a class="el" href="namespacetvm_1_1relay.html#a62b651084b386991221bc32c020cbef5">tvm::relay</a>
 </li>
 <li>DetectIterMap()
-: <a class="el" href="namespacetvm_1_1arith.html#a60ff187f559dba2d570d6a96f2fced15">tvm::arith</a>
+: <a class="el" href="namespacetvm_1_1arith.html#ab1eb48d4326e3a530b3d77f391e83175">tvm::arith</a>
 </li>
 <li>DetectLinearEquation()
 : <a class="el" href="namespacetvm_1_1arith.html#a87a12ee0854469b04329a961ef261559">tvm::arith</a>
diff --git a/docs/reference/api/doxygen/namespacemembers_func_d.html b/docs/reference/api/doxygen/namespacemembers_func_d.html
index d753fc0d3..a911bcfa6 100644
--- a/docs/reference/api/doxygen/namespacemembers_func_d.html
+++ b/docs/reference/api/doxygen/namespacemembers_func_d.html
@@ -118,7 +118,7 @@ $(function() {
 : <a class="el" href="namespacetvm_1_1relay.html#a62b651084b386991221bc32c020cbef5">tvm::relay</a>
 </li>
 <li>DetectIterMap()
-: <a class="el" href="namespacetvm_1_1arith.html#a60ff187f559dba2d570d6a96f2fced15">tvm::arith</a>
+: <a class="el" href="namespacetvm_1_1arith.html#ab1eb48d4326e3a530b3d77f391e83175">tvm::arith</a>
 </li>
 <li>DetectLinearEquation()
 : <a class="el" href="namespacetvm_1_1arith.html#a87a12ee0854469b04329a961ef261559">tvm::arith</a>
diff --git a/docs/reference/api/doxygen/namespacetvm_1_1arith.html b/docs/reference/api/doxygen/namespacetvm_1_1arith.html
index 97cb0656b..8476715a3 100644
--- a/docs/reference/api/doxygen/namespacetvm_1_1arith.html
+++ b/docs/reference/api/doxygen/namespacetvm_1_1arith.html
@@ -253,9 +253,9 @@ Functions</h2></td></tr>
 <tr class="memitem:ab667739c074bb7bf1e63302904c78176"><td class="memItemLeft" align="right" valign="top"><a class="el" href="classtvm_1_1arith_1_1IntConstraintsTransform.html">IntConstraintsTransform</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm_1_1arith.html#ab667739c074bb7bf1e63302904c78176">SolveInequalitiesDeskewRange</a> (const <a class="el" href="classtvm_1_1arith_1_1IntConstraints.html">IntConstraints</a> &amp;system_to_solve)</td></tr>
 <tr class="memdesc:ab667739c074bb7bf1e63302904c78176"><td class="mdescLeft">&#160;</td><td class="mdescRight">Solve linear inequalities and deskew the ranges towards zero.  <a href="#ab667739c074bb7bf1e63302904c78176">More...</a><br /></td></tr>
 <tr class="separator:ab667739c074bb7bf1e63302904c78176"><td class="memSeparator" colspan="2">&#160;</td></tr>
-<tr class="memitem:a60ff187f559dba2d570d6a96f2fced15"><td class="memItemLeft" align="right" valign="top"><a class="el" href="classtvm_1_1runtime_1_1Array.html">Array</a>&lt; <a class="el" href="classtvm_1_1arith_1_1IterSumExpr.html">IterSumExpr</a> &gt;&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm_1_1arith.html#a60ff187f559dba2d570d6a96f2fced15">DetectIterMap</a> (const <a class="el" href="classtvm_1_1runtime_1_1Array.html">Array</a>&lt; <a class="e [...]
-<tr class="memdesc:a60ff187f559dba2d570d6a96f2fced15"><td class="mdescLeft">&#160;</td><td class="mdescRight">Detect if indices can be written as [y_0 + c_0, y_1 + c_1, ..., y_n + c_n].  <a href="#a60ff187f559dba2d570d6a96f2fced15">More...</a><br /></td></tr>
-<tr class="separator:a60ff187f559dba2d570d6a96f2fced15"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:ab1eb48d4326e3a530b3d77f391e83175"><td class="memItemLeft" align="right" valign="top"><a class="el" href="classtvm_1_1runtime_1_1Array.html">Array</a>&lt; <a class="el" href="classtvm_1_1arith_1_1IterSumExpr.html">IterSumExpr</a> &gt;&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm_1_1arith.html#ab1eb48d4326e3a530b3d77f391e83175">DetectIterMap</a> (const <a class="el" href="classtvm_1_1runtime_1_1Array.html">Array</a>&lt; <a class="e [...]
+<tr class="memdesc:ab1eb48d4326e3a530b3d77f391e83175"><td class="mdescLeft">&#160;</td><td class="mdescRight">Detect if indices can be written as [y_0 + c_0, y_1 + c_1, ..., y_n + c_n].  <a href="#ab1eb48d4326e3a530b3d77f391e83175">More...</a><br /></td></tr>
+<tr class="separator:ab1eb48d4326e3a530b3d77f391e83175"><td class="memSeparator" colspan="2">&#160;</td></tr>
 <tr class="memitem:ab26374719c9dc2fe371f684ff8a33474"><td class="memItemLeft" align="right" valign="top"><a class="el" href="classtvm_1_1runtime_1_1Array.html">Array</a>&lt; <a class="el" href="classtvm_1_1PrimExpr.html">PrimExpr</a> &gt;&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacetvm_1_1arith.html#ab26374719c9dc2fe371f684ff8a33474">IterMapSimplify</a> (const <a class="el" href="classtvm_1_1runtime_1_1Array.html">Array</a>&lt; <a class="el" href="clas [...]
 <tr class="memdesc:ab26374719c9dc2fe371f684ff8a33474"><td class="mdescLeft">&#160;</td><td class="mdescRight">Use IterVarMap detector to rewrite and simplify the indices.  <a href="#ab26374719c9dc2fe371f684ff8a33474">More...</a><br /></td></tr>
 <tr class="separator:ab26374719c9dc2fe371f684ff8a33474"><td class="memSeparator" colspan="2">&#160;</td></tr>
@@ -571,8 +571,8 @@ Variables</h2></td></tr>
 
 </div>
 </div>
-<a id="a60ff187f559dba2d570d6a96f2fced15"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a60ff187f559dba2d570d6a96f2fced15">&#9670;&nbsp;</a></span>DetectIterMap()</h2>
+<a id="ab1eb48d4326e3a530b3d77f391e83175"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#ab1eb48d4326e3a530b3d77f391e83175">&#9670;&nbsp;</a></span>DetectIterMap()</h2>
 
 <div class="memitem">
 <div class="memproto">
@@ -605,7 +605,13 @@ Variables</h2></td></tr>
           <td class="paramkey"></td>
           <td></td>
           <td class="paramtype"><a class="el" href="classtvm_1_1arith_1_1Analyzer.html">arith::Analyzer</a> *&#160;</td>
-          <td class="paramname"><em>analyzer</em>&#160;</td>
+          <td class="paramname"><em>analyzer</em>, </td>
+        </tr>
+        <tr>
+          <td class="paramkey"></td>
+          <td></td>
+          <td class="paramtype">bool&#160;</td>
+          <td class="paramname"><em>simplify_trivial_iterators</em> = <code>true</code>&#160;</td>
         </tr>
         <tr>
           <td></td>
@@ -627,7 +633,8 @@ Variables</h2></td></tr>
     <tr><td class="paramname">input_iters</td><td>Map from variable to iterator's range. </td></tr>
     <tr><td class="paramname">predicate</td><td>The predicate constraints on the input iterators </td></tr>
     <tr><td class="paramname">require_bijective</td><td>A boolean flag that indicates whether the mapping should be bijective. </td></tr>
-    <tr><td class="paramname">analyzer</td><td><a class="el" href="classtvm_1_1arith_1_1Analyzer.html" title="Analyzer that contains bunch of sub-analyzers. ">Analyzer</a> used to get context information.</td></tr>
+    <tr><td class="paramname">analyzer</td><td><a class="el" href="classtvm_1_1arith_1_1Analyzer.html" title="Analyzer that contains bunch of sub-analyzers. ">Analyzer</a> used to get context information. </td></tr>
+    <tr><td class="paramname">simplify_trivial_iterators</td><td>If true, iterators with extent of 1 will be replaced with a constant value.</td></tr>
   </table>
   </dd>
 </dl>
@@ -1099,7 +1106,7 @@ Variables</h2></td></tr>
 <p>Apply the inverse of the affine transformation to the outputs. </p>
 <p>Similar to the back-propagation, starting from the outputs, it visits the DAG of the expressions in reverse topology order and applies the inverse of the affine transformation until it reaches the input. The affine iter map is required to be bijective.</p>
 <p>For example, iter_map = [l0 // 16, l0 % 16], outputs = [output_0, output_1], the affine transformation specified by <code>iter_map</code> will be applied to <code>outputs</code> and the result will be {l0: ((output_0*16) + output_1)}.</p>
-<dl class="section see"><dt>See also</dt><dd><a class="el" href="namespacetvm_1_1arith.html#a60ff187f559dba2d570d6a96f2fced15" title="Detect if indices can be written as [y_0 + c_0, y_1 + c_1, ..., y_n + c_n]. ">DetectIterMap</a></dd></dl>
+<dl class="section see"><dt>See also</dt><dd><a class="el" href="namespacetvm_1_1arith.html#ab1eb48d4326e3a530b3d77f391e83175" title="Detect if indices can be written as [y_0 + c_0, y_1 + c_1, ..., y_n + c_n]. ">DetectIterMap</a></dd></dl>
 <dl class="params"><dt>Parameters</dt><dd>
   <table class="params">
     <tr><td class="paramname">iter_map</td><td>The bijective affine iter map. </td></tr>
diff --git a/docs/reference/api/doxygen/search/all_5.js b/docs/reference/api/doxygen/search/all_5.js
index 76ff5785f..af915a4a2 100644
--- a/docs/reference/api/doxygen/search/all_5.js
+++ b/docs/reference/api/doxygen/search/all_5.js
@@ -74,7 +74,7 @@ var searchData=
   ['detectbufferaccesslca',['DetectBufferAccessLCA',['../namespacetvm_1_1tir.html#abbd3ced524b506f532aa1d8ae36dadf3',1,'tvm::tir']]],
   ['detectclipbound',['DetectClipBound',['../namespacetvm_1_1arith.html#a739616342876c2633b87ed16c649bc91',1,'tvm::arith']]],
   ['detectfeature',['DetectFeature',['../namespacetvm_1_1relay.html#a62b651084b386991221bc32c020cbef5',1,'tvm::relay::DetectFeature(const RelayExpr &amp;expr)'],['../namespacetvm_1_1relay.html#a81978c82e1130854e575ccabc152ad70',1,'tvm::relay::DetectFeature(const IRModule &amp;mod)'],['../namespacetvm_1_1relay.html#a191d5425083368521d49cc49cef65aba',1,'tvm::relay::DetectFeature(const Expr &amp;expr, const IRModule &amp;mod)']]],
-  ['detectitermap',['DetectIterMap',['../namespacetvm_1_1arith.html#a60ff187f559dba2d570d6a96f2fced15',1,'tvm::arith']]],
+  ['detectitermap',['DetectIterMap',['../namespacetvm_1_1arith.html#ab1eb48d4326e3a530b3d77f391e83175',1,'tvm::arith']]],
   ['detectlinearequation',['DetectLinearEquation',['../namespacetvm_1_1arith.html#a87a12ee0854469b04329a961ef261559',1,'tvm::arith']]],
   ['dev',['dev',['../structtvm_1_1runtime_1_1profiling_1_1CallFrame.html#abe09bc06a0a25001435ef512865d6259',1,'tvm::runtime::profiling::CallFrame']]],
   ['device',['device',['../classtvm_1_1auto__scheduler_1_1ProgramRunnerNode.html#acc0be09b0c6b0f21aef92088c0e38602',1,'tvm::auto_scheduler::ProgramRunnerNode::device()'],['../structtvm_1_1runtime_1_1profiling_1_1DeviceWrapperNode.html#a1c3c3c0fc8f177ddedc0ec02ca77b123',1,'tvm::runtime::profiling::DeviceWrapperNode::device()'],['../structtvm_1_1runtime_1_1vm_1_1Buffer.html#a2dc9562c031262e16ff6e8d007f601f2',1,'tvm::runtime::vm::Buffer::device()'],['../namespacetvm.html#a7c2095aed90b2129ba [...]
diff --git a/docs/reference/api/doxygen/search/functions_4.js b/docs/reference/api/doxygen/search/functions_4.js
index bfbe848b2..efd8cfa7a 100644
--- a/docs/reference/api/doxygen/search/functions_4.js
+++ b/docs/reference/api/doxygen/search/functions_4.js
@@ -34,7 +34,7 @@ var searchData=
   ['detectbufferaccesslca',['DetectBufferAccessLCA',['../namespacetvm_1_1tir.html#abbd3ced524b506f532aa1d8ae36dadf3',1,'tvm::tir']]],
   ['detectclipbound',['DetectClipBound',['../namespacetvm_1_1arith.html#a739616342876c2633b87ed16c649bc91',1,'tvm::arith']]],
   ['detectfeature',['DetectFeature',['../namespacetvm_1_1relay.html#a62b651084b386991221bc32c020cbef5',1,'tvm::relay::DetectFeature(const RelayExpr &amp;expr)'],['../namespacetvm_1_1relay.html#a81978c82e1130854e575ccabc152ad70',1,'tvm::relay::DetectFeature(const IRModule &amp;mod)'],['../namespacetvm_1_1relay.html#a191d5425083368521d49cc49cef65aba',1,'tvm::relay::DetectFeature(const Expr &amp;expr, const IRModule &amp;mod)']]],
-  ['detectitermap',['DetectIterMap',['../namespacetvm_1_1arith.html#a60ff187f559dba2d570d6a96f2fced15',1,'tvm::arith']]],
+  ['detectitermap',['DetectIterMap',['../namespacetvm_1_1arith.html#ab1eb48d4326e3a530b3d77f391e83175',1,'tvm::arith']]],
   ['detectlinearequation',['DetectLinearEquation',['../namespacetvm_1_1arith.html#a87a12ee0854469b04329a961ef261559',1,'tvm::arith']]],
   ['device_5ftype',['device_type',['../classtvm_1_1VirtualDeviceNode.html#a5e3f67045652bc27b937acf1ddc677a7',1,'tvm::VirtualDeviceNode']]],
   ['devicecopy',['DeviceCopy',['../structtvm_1_1runtime_1_1vm_1_1Instruction.html#ad38748aeb7650b185d8548e491aa9da6',1,'tvm::runtime::vm::Instruction']]],
diff --git a/docs/reference/api/python/auto_scheduler.html b/docs/reference/api/python/auto_scheduler.html
index ff8424ebc..95e09d5e4 100644
--- a/docs/reference/api/python/auto_scheduler.html
+++ b/docs/reference/api/python/auto_scheduler.html
@@ -1713,7 +1713,7 @@ Can be the a function or the function name.</p></li>
 
 <dl class="py function">
 <dt class="sig sig-object py" id="tvm.auto_scheduler.auto_schedule">
-<span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">auto_schedule</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">search_policy</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em clas [...]
+<span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">auto_schedule</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">search_policy</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em clas [...]
 <dd><p>THIS API IS DEPRECATED.</p>
 <p>Run auto scheduling search for a task.</p>
 <dl class="field-list simple">
@@ -1750,7 +1750,7 @@ the initial naive schedule (state).</p>
 
 <dl class="py class">
 <dt class="sig sig-object py" id="tvm.auto_scheduler.SketchPolicy">
-<em class="property"><span class="pre">class</span> </em><span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">SketchPolicy</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">program_cost_model</span></span><span class="o"><span class="pre">=</span></span><span class="defau [...]
+<em class="property"><span class="pre">class</span> </em><span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">SketchPolicy</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">program_cost_model</span></span><span class="o"><span class="pre">=</span></span><span class="defau [...]
 <dd><p>The search policy that searches in a hierarchical search space defined by sketches.
 The policy randomly samples programs from the space defined by sketches and use evolutionary
 search to fine-tune them.</p>
diff --git a/docs/reference/api/typedoc/classes/bytestreamreader.html b/docs/reference/api/typedoc/classes/bytestreamreader.html
index 94991495b..61fd804b1 100644
--- a/docs/reference/api/typedoc/classes/bytestreamreader.html
+++ b/docs/reference/api/typedoc/classes/bytestreamreader.html
@@ -119,7 +119,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -141,7 +141,7 @@
 					<div class="tsd-signature tsd-kind-icon">bytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Uint8Array</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -151,7 +151,7 @@
 					<div class="tsd-signature tsd-kind-icon">offset<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L42">rpc_server.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L42">rpc_server.ts:42</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -168,7 +168,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L63">rpc_server.ts:63</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L63">rpc_server.ts:63</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">Uint8Array</span></h4>
@@ -185,7 +185,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L49">rpc_server.ts:49</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L49">rpc_server.ts:49</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -202,7 +202,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L57">rpc_server.ts:57</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L57">rpc_server.ts:57</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
diff --git a/docs/reference/api/typedoc/classes/cachedcallstack.html b/docs/reference/api/typedoc/classes/cachedcallstack.html
index 5eb0b5517..8b9651dea 100644
--- a/docs/reference/api/typedoc/classes/cachedcallstack.html
+++ b/docs/reference/api/typedoc/classes/cachedcallstack.html
@@ -144,7 +144,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L223">memory.ts:223</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L223">memory.ts:223</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -172,7 +172,7 @@
 					<div class="tsd-signature tsd-kind-icon">temp<wbr>Args<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><a href="../interfaces/disposable.html" class="tsd-signature-type">Disposable</a><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = []</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L208">memory.ts:208</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L208">memory.ts:208</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -194,7 +194,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L312">memory.ts:312</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L312">memory.ts:312</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -226,7 +226,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L284">memory.ts:284</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L284">memory.ts:284</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -262,7 +262,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L388">memory.ts:388</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L388">memory.ts:388</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -300,7 +300,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L376">memory.ts:376</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L376">memory.ts:376</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -340,7 +340,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L267">memory.ts:267</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L267">memory.ts:267</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -373,7 +373,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L243">memory.ts:243</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L243">memory.ts:243</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -390,7 +390,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L321">memory.ts:321</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L321">memory.ts:321</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -422,7 +422,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L252">memory.ts:252</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L252">memory.ts:252</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -444,7 +444,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L359">memory.ts:359</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L359">memory.ts:359</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -470,7 +470,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L342">memory.ts:342</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L342">memory.ts:342</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -496,7 +496,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L350">memory.ts:350</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L350">memory.ts:350</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -522,7 +522,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L326">memory.ts:326</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L326">memory.ts:326</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -548,7 +548,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L363">memory.ts:363</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L363">memory.ts:363</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -574,7 +574,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L346">memory.ts:346</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L346">memory.ts:346</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -600,7 +600,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L334">memory.ts:334</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L334">memory.ts:334</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
diff --git a/docs/reference/api/typedoc/classes/dldatatype.html b/docs/reference/api/typedoc/classes/dldatatype.html
index 5b79c74c4..9a4718bdf 100644
--- a/docs/reference/api/typedoc/classes/dldatatype.html
+++ b/docs/reference/api/typedoc/classes/dldatatype.html
@@ -119,7 +119,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L262">runtime.ts:262</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L262">runtime.ts:262</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -147,7 +147,7 @@
 					<div class="tsd-signature tsd-kind-icon">bits<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L260">runtime.ts:260</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L260">runtime.ts:260</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">code<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L258">runtime.ts:258</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L258">runtime.ts:258</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -177,7 +177,7 @@
 					<div class="tsd-signature tsd-kind-icon">lanes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L262">runtime.ts:262</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L262">runtime.ts:262</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -199,7 +199,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L279">runtime.ts:279</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L279">runtime.ts:279</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -216,7 +216,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L270">runtime.ts:270</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L270">runtime.ts:270</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">string</span></h4>
diff --git a/docs/reference/api/typedoc/classes/dldevice.html b/docs/reference/api/typedoc/classes/dldevice.html
index 0814704f9..5febcb56d 100644
--- a/docs/reference/api/typedoc/classes/dldevice.html
+++ b/docs/reference/api/typedoc/classes/dldevice.html
@@ -118,7 +118,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L202">runtime.ts:202</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L202">runtime.ts:202</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -146,7 +146,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<wbr>Id<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L200">runtime.ts:200</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L200">runtime.ts:200</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -161,7 +161,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L198">runtime.ts:198</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L198">runtime.ts:198</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -183,7 +183,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L223">runtime.ts:223</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L223">runtime.ts:223</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -205,7 +205,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L230">runtime.ts:230</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L230">runtime.ts:230</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">string</span></h4>
diff --git a/docs/reference/api/typedoc/classes/environment.html b/docs/reference/api/typedoc/classes/environment.html
index fbe703084..065b4c83e 100644
--- a/docs/reference/api/typedoc/classes/environment.html
+++ b/docs/reference/api/typedoc/classes/environment.html
@@ -125,7 +125,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/environment.ts#L86">environment.ts:86</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/environment.ts#L86">environment.ts:86</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -169,7 +169,7 @@
 					<aside class="tsd-sources">
 						<p>Implementation of <a href="../interfaces/libraryprovider.html">LibraryProvider</a>.<a href="../interfaces/libraryprovider.html#imports">imports</a></p>
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/environment.ts#L70">environment.ts:70</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/environment.ts#L70">environment.ts:70</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 					<div class="tsd-signature tsd-kind-icon">logger<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>msg<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/environment.ts#L69">environment.ts:69</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/environment.ts#L69">environment.ts:69</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -210,7 +210,7 @@
 					<div class="tsd-signature tsd-kind-icon">packedCFunc<wbr>Table<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">ctypes.FTVMWasmPackedCFunc</span><span class="tsd-signature-symbol"> | </span><span class="tsd-signature-type">undefined</span><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = [undefined,]</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/environment.ts#L78">environment.ts:78</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/environment.ts#L78">environment.ts:78</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -228,7 +228,7 @@
 					<div class="tsd-signature tsd-kind-icon">packedCFunc<wbr>Table<wbr>Free<wbr>Id<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = []</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/environment.ts#L84">environment.ts:84</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/environment.ts#L84">environment.ts:84</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -250,7 +250,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/environment.ts#L105">environment.ts:105</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/environment.ts#L105">environment.ts:105</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/ffilibrary.html b/docs/reference/api/typedoc/classes/ffilibrary.html
index 567d3fef0..11c5c6187 100644
--- a/docs/reference/api/typedoc/classes/ffilibrary.html
+++ b/docs/reference/api/typedoc/classes/ffilibrary.html
@@ -131,7 +131,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L49">runtime.ts:49</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L49">runtime.ts:49</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -156,7 +156,7 @@
 					<div class="tsd-signature tsd-kind-icon">exports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">Function</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L46">runtime.ts:46</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L46">runtime.ts:46</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -166,7 +166,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L45">runtime.ts:45</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L45">runtime.ts:45</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">wasm32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">boolean</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L44">runtime.ts:44</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L44">runtime.ts:44</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -186,7 +186,7 @@
 					<div class="tsd-signature tsd-kind-icon">webGPUContext<span class="tsd-signature-symbol">:</span> <a href="webgpucontext.html" class="tsd-signature-type">WebGPUContext</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L47">runtime.ts:47</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L47">runtime.ts:47</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -203,7 +203,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L76">runtime.ts:76</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L76">runtime.ts:76</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -226,7 +226,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L66">runtime.ts:66</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L66">runtime.ts:66</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -243,7 +243,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L84">runtime.ts:84</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L84">runtime.ts:84</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <a href="cachedcallstack.html" class="tsd-signature-type">CachedCallStack</a></h4>
@@ -260,7 +260,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L95">runtime.ts:95</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L95">runtime.ts:95</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -283,7 +283,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L72">runtime.ts:72</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L72">runtime.ts:72</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
diff --git a/docs/reference/api/typedoc/classes/graphexecutor.html b/docs/reference/api/typedoc/classes/graphexecutor.html
index 380dfcb63..6ec16f992 100644
--- a/docs/reference/api/typedoc/classes/graphexecutor.html
+++ b/docs/reference/api/typedoc/classes/graphexecutor.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L583">runtime.ts:583</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L583">runtime.ts:583</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">module<span class="tsd-signature-symbol">:</span> <a href="module.html" class="tsd-signature-type">Module</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L579">runtime.ts:579</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L579">runtime.ts:579</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L654">runtime.ts:654</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L654">runtime.ts:654</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -224,7 +224,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L597">runtime.ts:597</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L597">runtime.ts:597</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -241,7 +241,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L631">runtime.ts:631</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L631">runtime.ts:631</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -279,7 +279,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L644">runtime.ts:644</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L644">runtime.ts:644</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -310,7 +310,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L621">runtime.ts:621</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L621">runtime.ts:621</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -332,7 +332,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L609">runtime.ts:609</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L609">runtime.ts:609</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/instance.html b/docs/reference/api/typedoc/classes/instance.html
index 4fc6a293f..d79bd5988 100644
--- a/docs/reference/api/typedoc/classes/instance.html
+++ b/docs/reference/api/typedoc/classes/instance.html
@@ -139,7 +139,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L692">runtime.ts:692</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L692">runtime.ts:692</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -202,7 +202,7 @@
 					<div class="tsd-signature tsd-kind-icon">exports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">Function</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L684">runtime.ts:684</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L684">runtime.ts:684</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -212,7 +212,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L683">runtime.ts:683</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L683">runtime.ts:683</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -229,7 +229,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L932">runtime.ts:932</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L932">runtime.ts:932</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -260,7 +260,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L994">runtime.ts:994</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L994">runtime.ts:994</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -303,7 +303,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L924">runtime.ts:924</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L924">runtime.ts:924</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -341,7 +341,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L732">runtime.ts:732</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L732">runtime.ts:732</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -358,7 +358,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L952">runtime.ts:952</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L952">runtime.ts:952</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -402,7 +402,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L816">runtime.ts:816</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L816">runtime.ts:816</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -434,7 +434,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L1033">runtime.ts:1033</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L1033">runtime.ts:1033</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -465,7 +465,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L846">runtime.ts:846</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L846">runtime.ts:846</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -497,7 +497,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L750">runtime.ts:750</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L750">runtime.ts:750</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -520,7 +520,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L1013">runtime.ts:1013</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L1013">runtime.ts:1013</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -568,7 +568,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L789">runtime.ts:789</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L789">runtime.ts:789</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -608,7 +608,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L914">runtime.ts:914</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L914">runtime.ts:914</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -646,7 +646,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L1134">runtime.ts:1134</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L1134">runtime.ts:1134</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -698,7 +698,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L740">runtime.ts:740</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L740">runtime.ts:740</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -722,7 +722,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L868">runtime.ts:868</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L868">runtime.ts:868</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -754,7 +754,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L857">runtime.ts:857</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L857">runtime.ts:857</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -786,7 +786,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L940">runtime.ts:940</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L940">runtime.ts:940</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/memory.html b/docs/reference/api/typedoc/classes/memory.html
index 558989c89..45e7f4fe1 100644
--- a/docs/reference/api/typedoc/classes/memory.html
+++ b/docs/reference/api/typedoc/classes/memory.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L40">memory.ts:40</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L40">memory.ts:40</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -152,7 +152,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Memory</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L32">memory.ts:32</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L32">memory.ts:32</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">wasm32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">boolean</span><span class="tsd-signature-symbol"> = true</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L33">memory.ts:33</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L33">memory.ts:33</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L154">memory.ts:154</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L154">memory.ts:154</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -210,7 +210,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L90">memory.ts:90</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L90">memory.ts:90</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -233,7 +233,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L97">memory.ts:97</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L97">memory.ts:97</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -256,7 +256,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L74">memory.ts:74</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L74">memory.ts:74</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -279,7 +279,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L81">memory.ts:81</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L81">memory.ts:81</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -302,7 +302,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L104">memory.ts:104</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L104">memory.ts:104</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -325,7 +325,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L132">memory.ts:132</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L132">memory.ts:132</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -362,7 +362,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L145">memory.ts:145</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L145">memory.ts:145</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -393,7 +393,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L60">memory.ts:60</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L60">memory.ts:60</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -416,7 +416,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L67">memory.ts:67</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L67">memory.ts:67</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -439,7 +439,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L53">memory.ts:53</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L53">memory.ts:53</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -462,7 +462,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L114">memory.ts:114</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L114">memory.ts:114</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -485,7 +485,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L124">memory.ts:124</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L124">memory.ts:124</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -502,7 +502,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/memory.ts#L175">memory.ts:175</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/memory.ts#L175">memory.ts:175</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/module.html b/docs/reference/api/typedoc/classes/module.html
index 88376a0ea..ecc84f926 100644
--- a/docs/reference/api/typedoc/classes/module.html
+++ b/docs/reference/api/typedoc/classes/module.html
@@ -124,7 +124,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L504">runtime.ts:504</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L504">runtime.ts:504</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -170,7 +170,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L502">runtime.ts:502</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L502">runtime.ts:502</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -187,7 +187,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L516">runtime.ts:516</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L516">runtime.ts:516</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -204,7 +204,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L530">runtime.ts:530</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L530">runtime.ts:530</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -236,7 +236,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L561">runtime.ts:561</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L561">runtime.ts:561</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/ndarray.html b/docs/reference/api/typedoc/classes/ndarray.html
index 1d7b70d50..1024b43c7 100644
--- a/docs/reference/api/typedoc/classes/ndarray.html
+++ b/docs/reference/api/typedoc/classes/ndarray.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L304">runtime.ts:304</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L304">runtime.ts:304</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -158,7 +158,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<span class="tsd-signature-symbol">:</span> <a href="dldevice.html" class="tsd-signature-type">DLDevice</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L297">runtime.ts:297</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L297">runtime.ts:297</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -173,7 +173,7 @@
 					<div class="tsd-signature tsd-kind-icon">dtype<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L293">runtime.ts:293</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L293">runtime.ts:293</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -188,7 +188,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L289">runtime.ts:289</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L289">runtime.ts:289</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -203,7 +203,7 @@
 					<div class="tsd-signature tsd-kind-icon">ndim<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L291">runtime.ts:291</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L291">runtime.ts:291</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -218,7 +218,7 @@
 					<div class="tsd-signature tsd-kind-icon">shape<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L295">runtime.ts:295</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L295">runtime.ts:295</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -240,7 +240,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L370">runtime.ts:370</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L370">runtime.ts:370</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -273,7 +273,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L414">runtime.ts:414</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L414">runtime.ts:414</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -305,7 +305,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L355">runtime.ts:355</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L355">runtime.ts:355</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -322,7 +322,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L474">runtime.ts:474</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L474">runtime.ts:474</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -346,7 +346,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L443">runtime.ts:443</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L443">runtime.ts:443</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/packedfunccell.html b/docs/reference/api/typedoc/classes/packedfunccell.html
index 847fbb333..f7483b949 100644
--- a/docs/reference/api/typedoc/classes/packedfunccell.html
+++ b/docs/reference/api/typedoc/classes/packedfunccell.html
@@ -122,7 +122,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L158">runtime.ts:158</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L158">runtime.ts:158</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -147,7 +147,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L157">runtime.ts:157</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L157">runtime.ts:157</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -164,7 +164,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L165">runtime.ts:165</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L165">runtime.ts:165</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
diff --git a/docs/reference/api/typedoc/classes/rpcserver.html b/docs/reference/api/typedoc/classes/rpcserver.html
index 571e7d0a4..8f549ab1c 100644
--- a/docs/reference/api/typedoc/classes/rpcserver.html
+++ b/docs/reference/api/typedoc/classes/rpcserver.html
@@ -115,7 +115,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L92">rpc_server.ts:92</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L92">rpc_server.ts:92</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">get<wbr>Imports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">unknown</span><span class="tsd-signat [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L82">rpc_server.ts:82</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L82">rpc_server.ts:82</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -201,7 +201,7 @@
 					<div class="tsd-signature tsd-kind-icon">key<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L78">rpc_server.ts:78</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L78">rpc_server.ts:78</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -211,7 +211,7 @@
 					<div class="tsd-signature tsd-kind-icon">logger<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>msg<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L81">rpc_server.ts:81</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L81">rpc_server.ts:81</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -242,7 +242,7 @@
 					<div class="tsd-signature tsd-kind-icon">socket<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">WebSocket</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L79">rpc_server.ts:79</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L79">rpc_server.ts:79</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -252,7 +252,7 @@
 					<div class="tsd-signature tsd-kind-icon">state<span class="tsd-signature-symbol">:</span> <a href="../enums/rpcserverstate.html" class="tsd-signature-type">RPCServerState</a><span class="tsd-signature-symbol"> = RPCServerState.InitHeader</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L80">rpc_server.ts:80</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L80">rpc_server.ts:80</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -262,7 +262,7 @@
 					<div class="tsd-signature tsd-kind-icon">url<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L77">rpc_server.ts:77</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L77">rpc_server.ts:77</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/classes/scalar.html b/docs/reference/api/typedoc/classes/scalar.html
index 4242b08ba..2917bc747 100644
--- a/docs/reference/api/typedoc/classes/scalar.html
+++ b/docs/reference/api/typedoc/classes/scalar.html
@@ -112,7 +112,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L145">runtime.ts:145</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L145">runtime.ts:145</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -137,7 +137,7 @@
 					<div class="tsd-signature tsd-kind-icon">dtype<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L145">runtime.ts:145</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L145">runtime.ts:145</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -152,7 +152,7 @@
 					<div class="tsd-signature tsd-kind-icon">value<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L143">runtime.ts:143</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L143">runtime.ts:143</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/webgpucontext.html b/docs/reference/api/typedoc/classes/webgpucontext.html
index f0ce3a9ab..b72676b89 100644
--- a/docs/reference/api/typedoc/classes/webgpucontext.html
+++ b/docs/reference/api/typedoc/classes/webgpucontext.html
@@ -120,7 +120,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/webgpu.ts#L57">webgpu.ts:57</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/webgpu.ts#L57">webgpu.ts:57</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -145,7 +145,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">GPUDevice</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/webgpu.ts#L50">webgpu.ts:50</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/webgpu.ts#L50">webgpu.ts:50</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -155,7 +155,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/webgpu.ts#L51">webgpu.ts:51</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/webgpu.ts#L51">webgpu.ts:51</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -172,7 +172,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/webgpu.ts#L84">webgpu.ts:84</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/webgpu.ts#L84">webgpu.ts:84</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -209,7 +209,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/webgpu.ts#L170">webgpu.ts:170</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/webgpu.ts#L170">webgpu.ts:170</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -238,7 +238,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/webgpu.ts#L67">webgpu.ts:67</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/webgpu.ts#L67">webgpu.ts:67</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/enums/argtypecode.html b/docs/reference/api/typedoc/enums/argtypecode.html
index 219dfd72f..d64ec1032 100644
--- a/docs/reference/api/typedoc/enums/argtypecode.html
+++ b/docs/reference/api/typedoc/enums/argtypecode.html
@@ -106,7 +106,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLDevice<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 6</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L220">ctypes.ts:220</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L220">ctypes.ts:220</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -116,7 +116,7 @@
 					<div class="tsd-signature tsd-kind-icon">Float<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L216">ctypes.ts:216</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L216">ctypes.ts:216</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -126,7 +126,7 @@
 					<div class="tsd-signature tsd-kind-icon">Int<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L214">ctypes.ts:214</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L214">ctypes.ts:214</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -136,7 +136,7 @@
 					<div class="tsd-signature tsd-kind-icon">Null<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L218">ctypes.ts:218</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L218">ctypes.ts:218</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -146,7 +146,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMBytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 12</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L226">ctypes.ts:226</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L226">ctypes.ts:226</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -156,7 +156,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMDLTensor<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 7</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L221">ctypes.ts:221</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L221">ctypes.ts:221</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -166,7 +166,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMData<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 5</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L219">ctypes.ts:219</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L219">ctypes.ts:219</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMModule<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 9</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L223">ctypes.ts:223</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L223">ctypes.ts:223</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -186,7 +186,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMNDArray<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 13</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L227">ctypes.ts:227</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L227">ctypes.ts:227</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -196,7 +196,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMObject<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L222">ctypes.ts:222</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L222">ctypes.ts:222</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -206,7 +206,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMObjectRValue<wbr>Ref<wbr>Arg<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 14</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L228">ctypes.ts:228</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L228">ctypes.ts:228</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -216,7 +216,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMOpaque<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 3</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L217">ctypes.ts:217</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L217">ctypes.ts:217</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -226,7 +226,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMPacked<wbr>Func<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 10</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L224">ctypes.ts:224</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L224">ctypes.ts:224</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -236,7 +236,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 11</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L225">ctypes.ts:225</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L225">ctypes.ts:225</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -246,7 +246,7 @@
 					<div class="tsd-signature tsd-kind-icon">UInt<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L215">ctypes.ts:215</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L215">ctypes.ts:215</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/aynccallbackcode.html b/docs/reference/api/typedoc/enums/aynccallbackcode.html
index 08e80e5ad..f7151a30f 100644
--- a/docs/reference/api/typedoc/enums/aynccallbackcode.html
+++ b/docs/reference/api/typedoc/enums/aynccallbackcode.html
@@ -93,7 +93,7 @@
 					<div class="tsd-signature tsd-kind-icon">k<wbr>Exception<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 5</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L676">runtime.ts:676</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L676">runtime.ts:676</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -103,7 +103,7 @@
 					<div class="tsd-signature tsd-kind-icon">k<wbr>Return<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L675">runtime.ts:675</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L675">runtime.ts:675</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/dldatatypecode.html b/docs/reference/api/typedoc/enums/dldatatypecode.html
index 1d7a9f93f..18574949e 100644
--- a/docs/reference/api/typedoc/enums/dldatatypecode.html
+++ b/docs/reference/api/typedoc/enums/dldatatypecode.html
@@ -95,7 +95,7 @@
 					<div class="tsd-signature tsd-kind-icon">Float<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L242">runtime.ts:242</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L242">runtime.ts:242</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -105,7 +105,7 @@
 					<div class="tsd-signature tsd-kind-icon">Int<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L240">runtime.ts:240</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L240">runtime.ts:240</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -115,7 +115,7 @@
 					<div class="tsd-signature tsd-kind-icon">Opaque<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 3</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L243">runtime.ts:243</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L243">runtime.ts:243</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -125,7 +125,7 @@
 					<div class="tsd-signature tsd-kind-icon">UInt<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L241">runtime.ts:241</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L241">runtime.ts:241</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/rpcserverstate.html b/docs/reference/api/typedoc/enums/rpcserverstate.html
index 8e13e5a1f..d0619322a 100644
--- a/docs/reference/api/typedoc/enums/rpcserverstate.html
+++ b/docs/reference/api/typedoc/enums/rpcserverstate.html
@@ -90,7 +90,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Header<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L27">rpc_server.ts:27</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L27">rpc_server.ts:27</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -100,7 +100,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Header<wbr>Key<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L28">rpc_server.ts:28</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L28">rpc_server.ts:28</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -110,7 +110,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Server<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L29">rpc_server.ts:29</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L29">rpc_server.ts:29</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -120,7 +120,7 @@
 					<div class="tsd-signature tsd-kind-icon">Receive<wbr>Packet<wbr>Body<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L32">rpc_server.ts:32</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L32">rpc_server.ts:32</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -130,7 +130,7 @@
 					<div class="tsd-signature tsd-kind-icon">Receive<wbr>Packet<wbr>Header<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L31">rpc_server.ts:31</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L31">rpc_server.ts:31</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -140,7 +140,7 @@
 					<div class="tsd-signature tsd-kind-icon">Wait<wbr>For<wbr>Callback<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L30">rpc_server.ts:30</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L30">rpc_server.ts:30</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/sizeof.html b/docs/reference/api/typedoc/enums/sizeof.html
index 904793259..b91845ef0 100644
--- a/docs/reference/api/typedoc/enums/sizeof.html
+++ b/docs/reference/api/typedoc/enums/sizeof.html
@@ -100,7 +100,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLData<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = I32</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L206">ctypes.ts:206</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L206">ctypes.ts:206</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -110,7 +110,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLDevice<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = I32 + I32</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L207">ctypes.ts:207</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L207">ctypes.ts:207</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -120,7 +120,7 @@
 					<div class="tsd-signature tsd-kind-icon">F32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L203">ctypes.ts:203</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L203">ctypes.ts:203</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -130,7 +130,7 @@
 					<div class="tsd-signature tsd-kind-icon">F64<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L204">ctypes.ts:204</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L204">ctypes.ts:204</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -140,7 +140,7 @@
 					<div class="tsd-signature tsd-kind-icon">I32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L201">ctypes.ts:201</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L201">ctypes.ts:201</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -150,7 +150,7 @@
 					<div class="tsd-signature tsd-kind-icon">I64<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L202">ctypes.ts:202</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L202">ctypes.ts:202</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -160,7 +160,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMValue<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L205">ctypes.ts:205</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L205">ctypes.ts:205</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -170,7 +170,7 @@
 					<div class="tsd-signature tsd-kind-icon">U16<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L200">ctypes.ts:200</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L200">ctypes.ts:200</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -180,7 +180,7 @@
 					<div class="tsd-signature tsd-kind-icon">U8<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L199">ctypes.ts:199</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L199">ctypes.ts:199</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/index.html b/docs/reference/api/typedoc/index.html
index bc0710c3d..f75fba2e4 100644
--- a/docs/reference/api/typedoc/index.html
+++ b/docs/reference/api/typedoc/index.html
@@ -174,7 +174,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Alloc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>shape<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, ndim<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, dtypeCode<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, dtypeBits<span class="tsd [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L112">ctypes.ts:112</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L112">ctypes.ts:112</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -238,7 +238,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>From<wbr>Bytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, data<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nbytes<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">num [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L128">ctypes.ts:128</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L128">ctypes.ts:128</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -282,7 +282,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>From<wbr>To<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>from<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, to<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, stream<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-sig [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L144">ctypes.ts:144</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L144">ctypes.ts:144</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -326,7 +326,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>ToBytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, data<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nbytes<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</sp [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L136">ctypes.ts:136</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L136">ctypes.ts:136</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -370,7 +370,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L121">ctypes.ts:121</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L121">ctypes.ts:121</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -406,7 +406,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMBackend<wbr>PackedCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>argValues<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, argCodes<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nargs<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number< [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L160">ctypes.ts:160</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L160">ctypes.ts:160</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -458,7 +458,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMCFunc<wbr>Set<wbr>Return<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>ret<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, value<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCode<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signa [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L77">ctypes.ts:77</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L77">ctypes.ts:77</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -506,7 +506,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMCb<wbr>Arg<wbr>ToReturn<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>value<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, code<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span c [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L83">ctypes.ts:83</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L83">ctypes.ts:83</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -545,7 +545,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Call<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>func<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, argValues<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCode<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-t [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L67">ctypes.ts:67</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L67">ctypes.ts:67</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -601,7 +601,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>func<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L57">ctypes.ts:57</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L57">ctypes.ts:57</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -637,7 +637,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Get<wbr>Global<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>name<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, out<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span cla [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L100">ctypes.ts:100</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L100">ctypes.ts:100</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -676,7 +676,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>List<wbr>Global<wbr>Names<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>outSize<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, outArray<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&g [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L88">ctypes.ts:88</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L88">ctypes.ts:88</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -715,7 +715,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Register<wbr>Global<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>name<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, f<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, override<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</spa [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L94">ctypes.ts:94</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L94">ctypes.ts:94</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -758,7 +758,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMGet<wbr>Last<wbr>Error<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L34">ctypes.ts:34</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L34">ctypes.ts:34</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -788,7 +788,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L52">ctypes.ts:52</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L52">ctypes.ts:52</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -824,7 +824,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Get<wbr>Function<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, funcName<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, queryImports<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">numbe [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L42">ctypes.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L42">ctypes.ts:42</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -872,7 +872,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Import<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, dep<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-si [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L48">ctypes.ts:48</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L48">ctypes.ts:48</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -912,7 +912,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMSynchronize<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>deviceType<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, deviceId<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, stream<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signatur [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L150">ctypes.ts:150</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L150">ctypes.ts:150</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -954,7 +954,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Alloc<wbr>Space<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>size<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L167">ctypes.ts:167</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L167">ctypes.ts:167</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -990,7 +990,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Free<wbr>Space<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>ptr<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L170">ctypes.ts:170</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L170">ctypes.ts:170</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1026,7 +1026,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Func<wbr>Create<wbr>FromCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>resource<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, out<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&g [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L187">ctypes.ts:187</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L187">ctypes.ts:187</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1066,7 +1066,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>PackedCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>args<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCodes<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nargs<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L179">ctypes.ts:179</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L179">ctypes.ts:179</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1118,7 +1118,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>PackedCFunc<wbr>Finalizer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>resourceHandle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L193">ctypes.ts:193</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L193">ctypes.ts:193</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1154,7 +1154,7 @@
 					<div class="tsd-signature tsd-kind-icon">GPUPointer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/webgpu.ts#L25">webgpu.ts:25</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/webgpu.ts#L25">webgpu.ts:25</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1169,7 +1169,7 @@
 					<div class="tsd-signature tsd-kind-icon">Packed<wbr>Func<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">...</span>args<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol"> &amp; </span><a href="interfaces/disp [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L36">runtime.ts:36</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L36">runtime.ts:36</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1184,7 +1184,7 @@
 					<div class="tsd-signature tsd-kind-icon">Pointer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L25">ctypes.ts:25</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L25">ctypes.ts:25</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1199,7 +1199,7 @@
 					<div class="tsd-signature tsd-kind-icon">Ptr<wbr>Offset<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/ctypes.ts#L28">ctypes.ts:28</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/ctypes.ts#L28">ctypes.ts:28</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1217,7 +1217,7 @@
 					<div class="tsd-signature tsd-kind-icon">RPC_<wbr>MAGIC<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">1045105</span><span class="tsd-signature-symbol"> = 1045105</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/rpc_server.ts#L36">rpc_server.ts:36</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/rpc_server.ts#L36">rpc_server.ts:36</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1239,7 +1239,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/support.ts#L25">support.ts:25</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/support.ts#L25">support.ts:25</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1271,7 +1271,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/support.ts#L39">support.ts:39</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/support.ts#L39">support.ts:39</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1300,7 +1300,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/support.ts#L52">support.ts:52</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/support.ts#L52">support.ts:52</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1337,7 +1337,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/compact.ts#L38">compact.ts:38</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/compact.ts#L38">compact.ts:38</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1368,7 +1368,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/webgpu.ts#L30">webgpu.ts:30</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/webgpu.ts#L30">webgpu.ts:30</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1390,7 +1390,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/environment.ts#L32">environment.ts:32</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/environment.ts#L32">environment.ts:32</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1421,7 +1421,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/compact.ts#L24">compact.ts:24</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/compact.ts#L24">compact.ts:24</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1443,7 +1443,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L1356">runtime.ts:1356</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L1356">runtime.ts:1356</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1508,7 +1508,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/support.ts#L62">support.ts:62</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/support.ts#L62">support.ts:62</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1530,7 +1530,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLData<wbr>Type<wbr>Code<wbr>ToStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L246">runtime.ts:246</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L246">runtime.ts:246</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1539,7 +1539,7 @@
 						<div class="tsd-signature tsd-kind-icon">0<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;int&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L247">runtime.ts:247</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L247">runtime.ts:247</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1549,7 +1549,7 @@
 						<div class="tsd-signature tsd-kind-icon">1<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;uint&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L248">runtime.ts:248</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L248">runtime.ts:248</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1559,7 +1559,7 @@
 						<div class="tsd-signature tsd-kind-icon">2<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;float&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L249">runtime.ts:249</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L249">runtime.ts:249</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1569,7 +1569,7 @@
 						<div class="tsd-signature tsd-kind-icon">3<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;handle&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L250">runtime.ts:250</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L250">runtime.ts:250</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1580,7 +1580,7 @@
 					<div class="tsd-signature tsd-kind-icon">Device<wbr>Enum<wbr>ToStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L175">runtime.ts:175</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L175">runtime.ts:175</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1589,7 +1589,7 @@
 						<div class="tsd-signature tsd-kind-icon">1<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;cpu&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L176">runtime.ts:176</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L176">runtime.ts:176</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1599,7 +1599,7 @@
 						<div class="tsd-signature tsd-kind-icon">15<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;webgpu&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L180">runtime.ts:180</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L180">runtime.ts:180</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1609,7 +1609,7 @@
 						<div class="tsd-signature tsd-kind-icon">2<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;cuda&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L177">runtime.ts:177</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L177">runtime.ts:177</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1619,7 +1619,7 @@
 						<div class="tsd-signature tsd-kind-icon">4<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;opencl&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L178">runtime.ts:178</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L178">runtime.ts:178</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1629,7 +1629,7 @@
 						<div class="tsd-signature tsd-kind-icon">8<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;metal&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L179">runtime.ts:179</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L179">runtime.ts:179</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1640,7 +1640,7 @@
 					<div class="tsd-signature tsd-kind-icon">Device<wbr>Str<wbr>ToEnum<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L183">runtime.ts:183</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L183">runtime.ts:183</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1649,7 +1649,7 @@
 						<div class="tsd-signature tsd-kind-icon">cl<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 4</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L186">runtime.ts:186</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L186">runtime.ts:186</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1659,7 +1659,7 @@
 						<div class="tsd-signature tsd-kind-icon">cpu<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 1</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L184">runtime.ts:184</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L184">runtime.ts:184</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1669,7 +1669,7 @@
 						<div class="tsd-signature tsd-kind-icon">cuda<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 2</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L185">runtime.ts:185</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L185">runtime.ts:185</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1679,7 +1679,7 @@
 						<div class="tsd-signature tsd-kind-icon">metal<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 8</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L189">runtime.ts:189</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L189">runtime.ts:189</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1689,7 +1689,7 @@
 						<div class="tsd-signature tsd-kind-icon">opencl<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 4</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L187">runtime.ts:187</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L187">runtime.ts:187</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1699,7 +1699,7 @@
 						<div class="tsd-signature tsd-kind-icon">vulkan<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 7</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L188">runtime.ts:188</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L188">runtime.ts:188</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1709,7 +1709,7 @@
 						<div class="tsd-signature tsd-kind-icon">webgpu<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 15</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/runtime.ts#L190">runtime.ts:190</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/runtime.ts#L190">runtime.ts:190</a></li>
 							</ul>
 						</aside>
 					</section>
diff --git a/docs/reference/api/typedoc/interfaces/disposable.html b/docs/reference/api/typedoc/interfaces/disposable.html
index f5bc32cc1..529964ae5 100644
--- a/docs/reference/api/typedoc/interfaces/disposable.html
+++ b/docs/reference/api/typedoc/interfaces/disposable.html
@@ -113,7 +113,7 @@
 					<div class="tsd-signature tsd-kind-icon">dispose<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/types.ts#L52">types.ts:52</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/types.ts#L52">types.ts:52</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/interfaces/functioninfo.html b/docs/reference/api/typedoc/interfaces/functioninfo.html
index 27b782215..0d31a9b3b 100644
--- a/docs/reference/api/typedoc/interfaces/functioninfo.html
+++ b/docs/reference/api/typedoc/interfaces/functioninfo.html
@@ -95,7 +95,7 @@
 					<div class="tsd-signature tsd-kind-icon">arg_<wbr>types<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/webgpu.ts#L41">webgpu.ts:41</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/webgpu.ts#L41">webgpu.ts:41</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -105,7 +105,7 @@
 					<div class="tsd-signature tsd-kind-icon">launch_<wbr>param_<wbr>tags<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/webgpu.ts#L42">webgpu.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/webgpu.ts#L42">webgpu.ts:42</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -115,7 +115,7 @@
 					<div class="tsd-signature tsd-kind-icon">name<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/webgpu.ts#L40">webgpu.ts:40</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/webgpu.ts#L40">webgpu.ts:40</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/interfaces/libraryprovider.html b/docs/reference/api/typedoc/interfaces/libraryprovider.html
index 2a0c33784..2c71c2265 100644
--- a/docs/reference/api/typedoc/interfaces/libraryprovider.html
+++ b/docs/reference/api/typedoc/interfaces/libraryprovider.html
@@ -112,7 +112,7 @@
 					<div class="tsd-signature tsd-kind-icon">imports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/types.ts#L34">types.ts:34</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/types.ts#L34">types.ts:34</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -127,7 +127,7 @@
 					<div class="tsd-signature tsd-kind-icon">start<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>inst<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">Instance</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/f238900e6/web/src/types.ts#L39">types.ts:39</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/8bfe3bbb3/web/src/types.ts#L39">types.ts:39</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/searchindex.js b/docs/searchindex.js
index cce0cb30c..196f9e46b 100644
--- a/docs/searchindex.js
+++ b/docs/searchindex.js
@@ -1 +1 @@
-Search.setIndex({docnames:["arch/benchmark","arch/convert_layout","arch/debugger","arch/device_target_interactions","arch/frontend/tensorflow","arch/hybrid_script","arch/index","arch/inferbound","arch/introduction_to_module_serialization","arch/microtvm_design","arch/microtvm_project_api","arch/model_library_format","arch/pass_infra","arch/relay_intro","arch/relay_op_strategy","arch/runtime","arch/runtimes/vulkan","arch/security","arch/virtual_machine","contribute/ci","contribute/code_gu [...]
\ No newline at end of file
+Search.setIndex({docnames:["arch/benchmark","arch/convert_layout","arch/debugger","arch/device_target_interactions","arch/frontend/tensorflow","arch/hybrid_script","arch/index","arch/inferbound","arch/introduction_to_module_serialization","arch/microtvm_design","arch/microtvm_project_api","arch/model_library_format","arch/pass_infra","arch/relay_intro","arch/relay_op_strategy","arch/runtime","arch/runtimes/vulkan","arch/security","arch/virtual_machine","contribute/ci","contribute/code_gu [...]
\ No newline at end of file
diff --git a/docs/topic/vta/tutorials/autotvm/sg_execution_times.html b/docs/topic/vta/tutorials/autotvm/sg_execution_times.html
index 298707899..a7d2170fc 100644
--- a/docs/topic/vta/tutorials/autotvm/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/autotvm/sg_execution_times.html
@@ -300,10 +300,10 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-autotvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:21.331</strong> total execution time for <strong>topic_vta_tutorials_autotvm</strong> files:</p>
+<p><strong>00:20.124</strong> total execution time for <strong>topic_vta_tutorials_autotvm</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:21.126</strong>: <a class="reference internal" href="tune_relay_vta.html#sphx-glr-topic-vta-tutorials-autotvm-tune-relay-vta-py"><span class="std std-ref">Auto-tuning a convolutional network on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_vta.py</span></code>)</p></li>
-<li><p><strong>00:00.205</strong>: <a class="reference internal" href="tune_alu_vta.html#sphx-glr-topic-vta-tutorials-autotvm-tune-alu-vta-py"><span class="std std-ref">Auto-tuning a ALU fused op on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_alu_vta.py</span></code>)</p></li>
+<li><p><strong>00:19.933</strong>: <a class="reference internal" href="tune_relay_vta.html#sphx-glr-topic-vta-tutorials-autotvm-tune-relay-vta-py"><span class="std std-ref">Auto-tuning a convolutional network on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_vta.py</span></code>)</p></li>
+<li><p><strong>00:00.191</strong>: <a class="reference internal" href="tune_alu_vta.html#sphx-glr-topic-vta-tutorials-autotvm-tune-alu-vta-py"><span class="std std-ref">Auto-tuning a ALU fused op on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_alu_vta.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/topic/vta/tutorials/frontend/deploy_classification.html b/docs/topic/vta/tutorials/frontend/deploy_classification.html
index 3d4073a25..5e71130e7 100644
--- a/docs/topic/vta/tutorials/frontend/deploy_classification.html
+++ b/docs/topic/vta/tutorials/frontend/deploy_classification.html
@@ -539,7 +539,7 @@ and dense layer which will both be executed in fp32 on the CPU.</p></li>
   DeprecationWarning,
 /workspace/vta/tutorials/frontend/deploy_classification.py:213: DeprecationWarning: legacy graph executor behavior of producing json / lib / params will be removed in the next release. Please see documents of tvm.contrib.graph_executor.GraphModule for the  new recommended usage.
   relay_prog, target=tvm.target.Target(target, host=env.target_host), params=params
-resnet18_v1 inference graph built in 22.20s!
+resnet18_v1 inference graph built in 21.12s!
 </pre></div>
 </div>
 </div>
diff --git a/docs/topic/vta/tutorials/frontend/deploy_detection.html b/docs/topic/vta/tutorials/frontend/deploy_detection.html
index 03afd4e62..f3316a8f0 100644
--- a/docs/topic/vta/tutorials/frontend/deploy_detection.html
+++ b/docs/topic/vta/tutorials/frontend/deploy_detection.html
@@ -557,7 +557,7 @@ and dense layer which will both be executed in fp32 on the CPU.</p></li>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/relay/build_module.py:439: DeprecationWarning: Please use input parameter mod (tvm.IRModule) instead of deprecated parameter mod (tvm.relay.function.Function)
   DeprecationWarning,
-yolov3-tiny inference graph built in 15.38s!
+yolov3-tiny inference graph built in 14.80s!
 </pre></div>
 </div>
 </div>
diff --git a/docs/topic/vta/tutorials/frontend/sg_execution_times.html b/docs/topic/vta/tutorials/frontend/sg_execution_times.html
index fea548dfc..05c9d5e2f 100644
--- a/docs/topic/vta/tutorials/frontend/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/frontend/sg_execution_times.html
@@ -300,10 +300,10 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-frontend-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>01:30.190</strong> total execution time for <strong>topic_vta_tutorials_frontend</strong> files:</p>
+<p><strong>01:27.702</strong> total execution time for <strong>topic_vta_tutorials_frontend</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:47.943</strong>: <a class="reference internal" href="deploy_detection.html#sphx-glr-topic-vta-tutorials-frontend-deploy-detection-py"><span class="std std-ref">Deploy Pretrained Vision Detection Model from Darknet on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_detection.py</span></code>)</p></li>
-<li><p><strong>00:42.248</strong>: <a class="reference internal" href="deploy_classification.html#sphx-glr-topic-vta-tutorials-frontend-deploy-classification-py"><span class="std std-ref">Deploy Pretrained Vision Model from MxNet on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_classification.py</span></code>)</p></li>
+<li><p><strong>00:46.683</strong>: <a class="reference internal" href="deploy_detection.html#sphx-glr-topic-vta-tutorials-frontend-deploy-detection-py"><span class="std std-ref">Deploy Pretrained Vision Detection Model from Darknet on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_detection.py</span></code>)</p></li>
+<li><p><strong>00:41.019</strong>: <a class="reference internal" href="deploy_classification.html#sphx-glr-topic-vta-tutorials-frontend-deploy-classification-py"><span class="std std-ref">Deploy Pretrained Vision Model from MxNet on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_classification.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/topic/vta/tutorials/optimize/sg_execution_times.html b/docs/topic/vta/tutorials/optimize/sg_execution_times.html
index c6adb8ffd..14fe24a48 100644
--- a/docs/topic/vta/tutorials/optimize/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/optimize/sg_execution_times.html
@@ -300,10 +300,10 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-optimize-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:03.503</strong> total execution time for <strong>topic_vta_tutorials_optimize</strong> files:</p>
+<p><strong>00:03.486</strong> total execution time for <strong>topic_vta_tutorials_optimize</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:02.968</strong>: <a class="reference internal" href="convolution_opt.html#sphx-glr-topic-vta-tutorials-optimize-convolution-opt-py"><span class="std std-ref">2D Convolution Optimization</span></a> (<code class="docutils literal notranslate"><span class="pre">convolution_opt.py</span></code>)</p></li>
-<li><p><strong>00:00.535</strong>: <a class="reference internal" href="matrix_multiply_opt.html#sphx-glr-topic-vta-tutorials-optimize-matrix-multiply-opt-py"><span class="std std-ref">Matrix Multiply Blocking</span></a> (<code class="docutils literal notranslate"><span class="pre">matrix_multiply_opt.py</span></code>)</p></li>
+<li><p><strong>00:02.967</strong>: <a class="reference internal" href="convolution_opt.html#sphx-glr-topic-vta-tutorials-optimize-convolution-opt-py"><span class="std std-ref">2D Convolution Optimization</span></a> (<code class="docutils literal notranslate"><span class="pre">convolution_opt.py</span></code>)</p></li>
+<li><p><strong>00:00.518</strong>: <a class="reference internal" href="matrix_multiply_opt.html#sphx-glr-topic-vta-tutorials-optimize-matrix-multiply-opt-py"><span class="std std-ref">Matrix Multiply Blocking</span></a> (<code class="docutils literal notranslate"><span class="pre">matrix_multiply_opt.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/topic/vta/tutorials/sg_execution_times.html b/docs/topic/vta/tutorials/sg_execution_times.html
index 497f5b991..b88d8932c 100644
--- a/docs/topic/vta/tutorials/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/sg_execution_times.html
@@ -300,10 +300,10 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:00.964</strong> total execution time for <strong>topic_vta_tutorials</strong> files:</p>
+<p><strong>00:00.925</strong> total execution time for <strong>topic_vta_tutorials</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:00.490</strong>: <a class="reference internal" href="matrix_multiply.html#sphx-glr-topic-vta-tutorials-matrix-multiply-py"><span class="std std-ref">Simple Matrix Multiply</span></a> (<code class="docutils literal notranslate"><span class="pre">matrix_multiply.py</span></code>)</p></li>
-<li><p><strong>00:00.474</strong>: <a class="reference internal" href="vta_get_started.html#sphx-glr-topic-vta-tutorials-vta-get-started-py"><span class="std std-ref">Get Started with VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">vta_get_started.py</span></code>)</p></li>
+<li><p><strong>00:00.463</strong>: <a class="reference internal" href="matrix_multiply.html#sphx-glr-topic-vta-tutorials-matrix-multiply-py"><span class="std std-ref">Simple Matrix Multiply</span></a> (<code class="docutils literal notranslate"><span class="pre">matrix_multiply.py</span></code>)</p></li>
+<li><p><strong>00:00.462</strong>: <a class="reference internal" href="vta_get_started.html#sphx-glr-topic-vta-tutorials-vta-get-started-py"><span class="std std-ref">Get Started with VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">vta_get_started.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/tutorial/auto_scheduler_matmul_x86.html b/docs/tutorial/auto_scheduler_matmul_x86.html
index b95c0653b..58da5ce70 100644
--- a/docs/tutorial/auto_scheduler_matmul_x86.html
+++ b/docs/tutorial/auto_scheduler_matmul_x86.html
@@ -544,7 +544,7 @@ operator fusion.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 94.001 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 93.880 ms
 </pre></div>
 </div>
 </div>
@@ -620,6 +620,7 @@ automatically optimize a matrix multiplication, without the need to specify a
 search template.  It ends a series of examples that starts from the Tensor
 Expression (TE) language that demonstrates how TVM can optimize computational
 operations.</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  0.100 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorial-auto-scheduler-matmul-x86-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../_downloads/eac4389b114db015e95cb3cdf8b86b83/auto_scheduler_matmul_x86.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">auto_scheduler_matmul_x86.py</span></code></a></p>
diff --git a/docs/tutorial/autotvm_relay_x86.html b/docs/tutorial/autotvm_relay_x86.html
index 073fd8997..2273e1b4f 100644
--- a/docs/tutorial/autotvm_relay_x86.html
+++ b/docs/tutorial/autotvm_relay_x86.html
@@ -513,7 +513,7 @@ standard deviation.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>{&#39;mean&#39;: 497.8195587100008, &#39;median&#39;: 497.8889928000001, &#39;std&#39;: 1.2998905415882913}
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>{&#39;mean&#39;: 490.83991727999893, &#39;median&#39;: 491.0158592500011, &#39;std&#39;: 0.36641131670863597}
 </pre></div>
 </div>
 </div>
@@ -667,129 +667,129 @@ depending on the specifics of the model and the target platform.</p>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>[Task  1/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task  1/25]  Current/Best:   16.09/  23.78 GFLOPS | Progress: (4/10) | 5.97 s
-[Task  1/25]  Current/Best:   12.38/  23.78 GFLOPS | Progress: (8/10) | 9.09 s
-[Task  1/25]  Current/Best:   12.87/  23.78 GFLOPS | Progress: (10/10) | 10.51 s Done.
+[Task  1/25]  Current/Best:   17.07/  19.49 GFLOPS | Progress: (4/10) | 4.72 s
+[Task  1/25]  Current/Best:   24.09/  24.09 GFLOPS | Progress: (8/10) | 8.47 s
+[Task  1/25]  Current/Best:   12.84/  24.09 GFLOPS | Progress: (10/10) | 9.35 s Done.
 
 [Task  2/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task  2/25]  Current/Best:    6.93/  19.90 GFLOPS | Progress: (4/10) | 2.34 s
-[Task  2/25]  Current/Best:   12.59/  19.90 GFLOPS | Progress: (8/10) | 5.20 s
-[Task  2/25]  Current/Best:   13.76/  19.90 GFLOPS | Progress: (10/10) | 5.93 s Done.
+[Task  2/25]  Current/Best:    7.20/  22.41 GFLOPS | Progress: (4/10) | 2.29 s
+[Task  2/25]  Current/Best:   18.39/  22.41 GFLOPS | Progress: (8/10) | 3.46 s
+[Task  2/25]  Current/Best:    9.43/  22.41 GFLOPS | Progress: (10/10) | 4.18 s Done.
 
 [Task  3/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task  3/25]  Current/Best:   18.47/  20.88 GFLOPS | Progress: (4/10) | 4.04 s
-[Task  3/25]  Current/Best:   12.51/  23.11 GFLOPS | Progress: (8/10) | 5.68 s
-[Task  3/25]  Current/Best:   12.42/  23.11 GFLOPS | Progress: (10/10) | 6.59 s Done.
+[Task  3/25]  Current/Best:   12.44/  21.25 GFLOPS | Progress: (4/10) | 2.89 s
+[Task  3/25]  Current/Best:   17.48/  21.25 GFLOPS | Progress: (8/10) | 4.53 s
+[Task  3/25]  Current/Best:   18.40/  21.25 GFLOPS | Progress: (10/10) | 5.96 s Done.
 
 [Task  4/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task  4/25]  Current/Best:   15.14/  17.38 GFLOPS | Progress: (4/10) | 2.40 s
-[Task  4/25]  Current/Best:   12.29/  17.38 GFLOPS | Progress: (8/10) | 4.69 s
-[Task  4/25]  Current/Best:   17.37/  17.38 GFLOPS | Progress: (10/10) | 5.50 s Done.
+[Task  4/25]  Current/Best:   12.30/  14.07 GFLOPS | Progress: (4/10) | 6.93 s
+[Task  4/25]  Current/Best:    9.65/  20.11 GFLOPS | Progress: (8/10) | 8.55 s
+[Task  4/25]  Current/Best:   14.18/  20.11 GFLOPS | Progress: (10/10) | 11.60 s Done.
 
 [Task  5/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task  5/25]  Current/Best:   20.51/  20.51 GFLOPS | Progress: (4/10) | 3.74 s
-[Task  5/25]  Current/Best:   18.37/  22.59 GFLOPS | Progress: (8/10) | 5.51 s
-[Task  5/25]  Current/Best:    1.71/  22.59 GFLOPS | Progress: (10/10) | 7.02 s Done.
+[Task  5/25]  Current/Best:    9.64/  16.71 GFLOPS | Progress: (4/10) | 3.10 s
+[Task  5/25]  Current/Best:   13.63/  22.67 GFLOPS | Progress: (8/10) | 5.05 s
+[Task  5/25]  Current/Best:    6.13/  22.67 GFLOPS | Progress: (10/10) | 5.87 s Done.
 
 [Task  6/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task  6/25]  Current/Best:   12.60/  12.60 GFLOPS | Progress: (4/10) | 4.01 s
-[Task  6/25]  Current/Best:   14.84/  18.65 GFLOPS | Progress: (8/10) | 6.45 s
-[Task  6/25]  Current/Best:   13.74/  18.65 GFLOPS | Progress: (10/10) | 8.08 s Done.
+[Task  6/25]  Current/Best:    9.21/  14.94 GFLOPS | Progress: (4/10) | 3.76 s
+[Task  6/25]  Current/Best:   19.55/  19.55 GFLOPS | Progress: (8/10) | 5.94 s
+[Task  6/25]  Current/Best:    5.31/  19.55 GFLOPS | Progress: (10/10) | 7.04 s Done.
 
 [Task  7/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task  7/25]  Current/Best:   15.90/  16.23 GFLOPS | Progress: (4/10) | 3.29 s
-[Task  7/25]  Current/Best:   17.09/  17.09 GFLOPS | Progress: (8/10) | 5.05 s
-[Task  7/25]  Current/Best:   15.77/  17.09 GFLOPS | Progress: (10/10) | 6.06 s Done.
+[Task  7/25]  Current/Best:   15.88/  22.39 GFLOPS | Progress: (4/10) | 2.68 s
+[Task  7/25]  Current/Best:   15.74/  22.39 GFLOPS | Progress: (8/10) | 4.66 s
+[Task  7/25]  Current/Best:    1.59/  22.39 GFLOPS | Progress: (10/10) | 7.24 s Done.
 
 [Task  8/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task  8/25]  Current/Best:   22.54/  22.54 GFLOPS | Progress: (4/10) | 7.65 s
-[Task  8/25]  Current/Best:    3.82/  22.54 GFLOPS | Progress: (8/10) | 11.81 s
-[Task  8/25]  Current/Best:   10.26/  22.54 GFLOPS | Progress: (10/10) | 17.23 s Done.
-
+[Task  8/25]  Current/Best:   15.11/  18.82 GFLOPS | Progress: (4/10) | 2.93 s
+[Task  8/25]  Current/Best:   17.09/  18.82 GFLOPS | Progress: (8/10) | 14.28 s
+[Task  8/25]  Current/Best:   15.52/  18.82 GFLOPS | Progress: (10/10) | 15.04 s
 [Task  9/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task  9/25]  Current/Best:   18.40/  18.40 GFLOPS | Progress: (4/10) | 2.35 s
-[Task  9/25]  Current/Best:   21.66/  21.66 GFLOPS | Progress: (8/10) | 4.34 s
-[Task  9/25]  Current/Best:   17.37/  21.66 GFLOPS | Progress: (10/10) | 5.07 s Done.
+[Task  9/25]  Current/Best:    9.87/  16.51 GFLOPS | Progress: (4/10) | 3.35 s
+[Task  9/25]  Current/Best:   21.53/  21.53 GFLOPS | Progress: (8/10) | 7.51 s
+[Task  9/25]  Current/Best:   12.17/  21.53 GFLOPS | Progress: (10/10) | 8.83 s Done.
 
 [Task 10/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 10/25]  Current/Best:    5.55/  12.46 GFLOPS | Progress: (4/10) | 3.41 s
-[Task 10/25]  Current/Best:   17.16/  21.07 GFLOPS | Progress: (8/10) | 5.97 s
-[Task 10/25]  Current/Best:   18.13/  21.07 GFLOPS | Progress: (10/10) | 6.62 s Done.
+[Task 10/25]  Current/Best:   18.70/  18.70 GFLOPS | Progress: (4/10) | 3.24 s
+[Task 10/25]  Current/Best:   16.25/  18.70 GFLOPS | Progress: (8/10) | 4.92 s
+[Task 10/25]  Current/Best:   14.80/  18.70 GFLOPS | Progress: (10/10) | 5.64 s Done.
 
 [Task 11/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 11/25]  Current/Best:   24.34/  24.34 GFLOPS | Progress: (4/10) | 2.87 s
-[Task 11/25]  Current/Best:   12.13/  24.34 GFLOPS | Progress: (8/10) | 5.40 s
-[Task 11/25]  Current/Best:   14.89/  24.34 GFLOPS | Progress: (10/10) | 6.54 s Done.
+[Task 11/25]  Current/Best:    8.03/  19.17 GFLOPS | Progress: (4/10) | 3.29 s
+[Task 11/25]  Current/Best:    7.67/  19.17 GFLOPS | Progress: (8/10) | 5.87 s
+[Task 11/25]  Current/Best:    9.02/  20.35 GFLOPS | Progress: (10/10) | 6.80 s Done.
 
 [Task 12/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 12/25]  Current/Best:   15.22/  21.11 GFLOPS | Progress: (4/10) | 2.89 s
-[Task 12/25]  Current/Best:    4.69/  21.11 GFLOPS | Progress: (8/10) | 4.87 s
-[Task 12/25]  Current/Best:   15.17/  21.11 GFLOPS | Progress: (10/10) | 5.81 s Done.
+[Task 12/25]  Current/Best:    7.71/  18.45 GFLOPS | Progress: (4/10) | 4.15 s
+[Task 12/25]  Current/Best:   16.12/  18.45 GFLOPS | Progress: (8/10) | 7.24 s
+[Task 12/25]  Current/Best:   17.12/  18.45 GFLOPS | Progress: (10/10) | 8.03 s Done.
 
 [Task 13/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 13/25]  Current/Best:   19.04/  20.57 GFLOPS | Progress: (4/10) | 3.67 s
-[Task 13/25]  Current/Best:   18.20/  21.00 GFLOPS | Progress: (8/10) | 7.77 s
-[Task 13/25]  Current/Best:   17.24/  21.00 GFLOPS | Progress: (10/10) | 8.66 s Done.
+[Task 13/25]  Current/Best:    9.17/  20.78 GFLOPS | Progress: (4/10) | 3.62 s
+[Task 13/25]  Current/Best:    8.79/  20.78 GFLOPS | Progress: (8/10) | 6.17 s
+[Task 13/25]  Current/Best:    7.50/  20.78 GFLOPS | Progress: (10/10) | 7.95 s Done.
 
 [Task 14/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 14/25]  Current/Best:   18.08/  18.08 GFLOPS | Progress: (4/10) | 2.89 s
-[Task 14/25]  Current/Best:   13.93/  18.08 GFLOPS | Progress: (8/10) | 5.38 s
-[Task 14/25]  Current/Best:   11.46/  18.08 GFLOPS | Progress: (10/10) | 6.40 s Done.
+[Task 14/25]  Current/Best:    3.10/  18.74 GFLOPS | Progress: (4/10) | 3.60 s
+[Task 14/25]  Current/Best:   13.76/  18.74 GFLOPS | Progress: (8/10) | 6.71 s
+[Task 14/25]  Current/Best:   10.73/  18.76 GFLOPS | Progress: (10/10) | 8.62 s Done.
 
 [Task 15/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 15/25]  Current/Best:    9.73/  18.08 GFLOPS | Progress: (4/10) | 3.34 s
-[Task 15/25]  Current/Best:   15.40/  19.10 GFLOPS | Progress: (8/10) | 7.58 s
-[Task 15/25]  Current/Best:    9.29/  19.10 GFLOPS | Progress: (10/10) | 10.29 s
+[Task 15/25]  Current/Best:   14.88/  16.19 GFLOPS | Progress: (4/10) | 2.45 s
+[Task 15/25]  Current/Best:    9.79/  21.69 GFLOPS | Progress: (8/10) | 6.08 s Done.
+
+[Task 15/25]  Current/Best:   14.80/  21.69 GFLOPS | Progress: (10/10) | 7.29 s Done.
+
 [Task 16/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 16/25]  Current/Best:   10.71/  13.77 GFLOPS | Progress: (4/10) | 2.45 s
-[Task 16/25]  Current/Best:   21.27/  21.27 GFLOPS | Progress: (8/10) | 5.18 s
-[Task 16/25]  Current/Best:   10.29/  21.27 GFLOPS | Progress: (10/10) | 5.89 s Done.
+[Task 16/25]  Current/Best:   20.48/  20.48 GFLOPS | Progress: (4/10) | 2.18 s
+[Task 16/25]  Current/Best:   21.93/  21.93 GFLOPS | Progress: (8/10) | 4.88 s
+[Task 16/25]  Current/Best:   15.49/  22.39 GFLOPS | Progress: (10/10) | 5.41 s Done.
 
 [Task 17/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 17/25]  Current/Best:    6.02/  13.89 GFLOPS | Progress: (4/10) | 3.72 s
-[Task 17/25]  Current/Best:    9.22/  21.85 GFLOPS | Progress: (8/10) | 5.84 s
-[Task 17/25]  Current/Best:   18.83/  21.85 GFLOPS | Progress: (10/10) | 6.80 s Done.
+[Task 17/25]  Current/Best:    6.17/  18.53 GFLOPS | Progress: (4/10) | 2.81 s
+[Task 17/25]  Current/Best:    7.49/  23.47 GFLOPS | Progress: (8/10) | 5.82 s
+[Task 17/25]  Current/Best:   17.05/  23.47 GFLOPS | Progress: (10/10) | 6.56 s Done.
 
 [Task 18/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 18/25]  Current/Best:   11.29/  17.49 GFLOPS | Progress: (4/10) | 8.26 s
-[Task 18/25]  Current/Best:    9.38/  22.13 GFLOPS | Progress: (8/10) | 11.46 s
-[Task 18/25]  Current/Best:   10.35/  22.13 GFLOPS | Progress: (10/10) | 15.54 s Done.
+[Task 18/25]  Current/Best:   11.12/  18.88 GFLOPS | Progress: (4/10) | 4.12 s
+[Task 18/25]  Current/Best:   10.15/  18.88 GFLOPS | Progress: (8/10) | 8.68 s
+[Task 18/25]  Current/Best:   10.03/  18.88 GFLOPS | Progress: (10/10) | 10.33 s Done.
 
 [Task 19/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 19/25]  Current/Best:   10.32/  22.57 GFLOPS | Progress: (4/10) | 4.06 s
-[Task 19/25]  Current/Best:   15.08/  23.47 GFLOPS | Progress: (8/10) | 7.38 s
-[Task 19/25]  Current/Best:   12.62/  23.47 GFLOPS | Progress: (10/10) | 9.03 s Done.
+[Task 19/25]  Current/Best:   10.99/  17.38 GFLOPS | Progress: (4/10) | 4.27 s
+[Task 19/25]  Current/Best:    1.56/  17.38 GFLOPS | Progress: (8/10) | 10.67 s
+[Task 19/25]  Current/Best:    9.06/  17.38 GFLOPS | Progress: (10/10) | 14.57 s Done.
 
 [Task 20/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 20/25]  Current/Best:   12.73/  12.73 GFLOPS | Progress: (4/10) | 3.09 s Done.
-
-[Task 20/25]  Current/Best:   12.74/  12.74 GFLOPS | Progress: (8/10) | 5.70 s
-[Task 20/25]  Current/Best:   10.48/  19.17 GFLOPS | Progress: (10/10) | 7.26 s Done.
-
+[Task 20/25]  Current/Best:    7.32/  14.24 GFLOPS | Progress: (4/10) | 3.25 s
+[Task 20/25]  Current/Best:   10.22/  16.95 GFLOPS | Progress: (8/10) | 6.16 s
+[Task 20/25]  Current/Best:    2.46/  16.95 GFLOPS | Progress: (10/10) | 8.04 s
 [Task 21/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 21/25]  Current/Best:   10.65/  16.39 GFLOPS | Progress: (4/10) | 2.39 s
-[Task 21/25]  Current/Best:   15.40/  18.35 GFLOPS | Progress: (8/10) | 6.83 s
-[Task 21/25]  Current/Best:    0.00/  18.35 GFLOPS | Progress: (10/10) | 7.18 s
+[Task 21/25]  Current/Best:   21.60/  21.60 GFLOPS | Progress: (4/10) | 2.61 s
+[Task 21/25]  Current/Best:   14.47/  21.60 GFLOPS | Progress: (8/10) | 4.45 s
+[Task 21/25]  Current/Best:    8.91/  21.60 GFLOPS | Progress: (10/10) | 5.19 s Done.
+
 [Task 22/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 22/25]  Current/Best:   14.99/  16.78 GFLOPS | Progress: (4/10) | 2.97 s
-[Task 22/25]  Current/Best:    8.72/  20.07 GFLOPS | Progress: (8/10) | 4.90 s
-[Task 22/25]  Current/Best:    3.09/  20.07 GFLOPS | Progress: (10/10) | 5.87 s Done.
+[Task 22/25]  Current/Best:   17.82/  20.36 GFLOPS | Progress: (4/10) | 2.40 s
+[Task 22/25]  Current/Best:   10.62/  20.36 GFLOPS | Progress: (8/10) | 4.10 s
+[Task 22/25]  Current/Best:   12.31/  20.64 GFLOPS | Progress: (10/10) | 4.85 s Done.
 
 [Task 23/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 23/25]  Current/Best:    1.55/  22.45 GFLOPS | Progress: (4/10) | 5.43 s
-[Task 23/25]  Current/Best:   19.03/  22.45 GFLOPS | Progress: (8/10) | 7.33 s
-[Task 23/25]  Current/Best:   13.38/  22.45 GFLOPS | Progress: (10/10) | 8.25 s Done.
+[Task 23/25]  Current/Best:    7.76/  17.21 GFLOPS | Progress: (4/10) | 4.00 s
+[Task 23/25]  Current/Best:    7.13/  20.32 GFLOPS | Progress: (8/10) | 7.85 s
+[Task 23/25]  Current/Best:   18.28/  20.32 GFLOPS | Progress: (10/10) | 8.75 s Done.
 
 [Task 24/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
-[Task 24/25]  Current/Best:    4.33/   4.33 GFLOPS | Progress: (4/10) | 218.91 s
-[Task 24/25]  Current/Best:    5.08/   8.58 GFLOPS | Progress: (8/10) | 231.82 s
-[Task 24/25]  Current/Best:    3.85/   8.58 GFLOPS | Progress: (10/10) | 234.78 s
+[Task 24/25]  Current/Best:   10.59/  10.59 GFLOPS | Progress: (4/10) | 12.92 s
+[Task 24/25]  Current/Best:    3.49/  10.59 GFLOPS | Progress: (8/10) | 15.93 s
+[Task 24/25]  Current/Best:    0.55/  10.59 GFLOPS | Progress: (10/10) | 27.61 s
 [Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s Done.
  Done.
 
-[Task 25/25]  Current/Best:    6.19/   6.19 GFLOPS | Progress: (4/10) | 18.08 s
-[Task 25/25]  Current/Best:    1.51/   9.27 GFLOPS | Progress: (8/10) | 32.75 s
-[Task 25/25]  Current/Best:    1.55/   9.27 GFLOPS | Progress: (10/10) | 52.61 s
+[Task 25/25]  Current/Best:    9.18/   9.18 GFLOPS | Progress: (4/10) | 2.59 s
+[Task 25/25]  Current/Best:    9.65/   9.65 GFLOPS | Progress: (8/10) | 21.21 s
+[Task 25/25]  Current/Best:    5.06/   9.65 GFLOPS | Progress: (10/10) | 22.69 s
 </pre></div>
 </div>
 <p>The output from this tuning process will look something like this:</p>
@@ -836,10 +836,6 @@ model using optimized operators to speed up our computations.</p>
 <span class="n">module</span> <span class="o">=</span> <a href="../reference/api/python/graph_executor.html#tvm.contrib.graph_executor.GraphModule" title="View documentation for tvm.contrib.graph_executor.GraphModule"><span class="n">graph_executor</span><span class="o">.</span><span class="n">GraphModule</span></a><span class="p">(</span><span class="n">lib</span><span class="p">[</span><span class="s2">&quot;default&quot;</span><span class="p">](</span><span class="n">dev</span><span c [...]
 </pre></div>
 </div>
-<p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Done.
-</pre></div>
-</div>
 <p>Verify that the optimized model runs and produces the same results:</p>
 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">dtype</span> <span class="o">=</span> <span class="s2">&quot;float32&quot;</span>
 <span class="n">module</span><span class="o">.</span><span class="n">set_input</span><span class="p">(</span><span class="n">input_name</span><span class="p">,</span> <span class="n">img_data</span><span class="p">)</span>
@@ -855,8 +851,8 @@ model using optimized operators to speed up our computations.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>class=&#39;n02123045 tabby, tabby cat&#39; with probability=0.621105
-class=&#39;n02123159 tiger cat&#39; with probability=0.356377
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>class=&#39;n02123045 tabby, tabby cat&#39; with probability=0.621104
+class=&#39;n02123159 tiger cat&#39; with probability=0.356379
 class=&#39;n02124075 Egyptian cat&#39; with probability=0.019712
 class=&#39;n02129604 tiger, Panthera tigris&#39; with probability=0.001215
 class=&#39;n04040759 radiator&#39; with probability=0.000262
@@ -894,8 +890,8 @@ improvement in comparing the optimized model to the unoptimized model.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>optimized: {&#39;mean&#39;: 425.86808551000104, &#39;median&#39;: 425.5981838000025, &#39;std&#39;: 1.237847702770817}
-unoptimized: {&#39;mean&#39;: 497.8195587100008, &#39;median&#39;: 497.8889928000001, &#39;std&#39;: 1.2998905415882913}
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>optimized: {&#39;mean&#39;: 422.83022096000195, &#39;median&#39;: 422.861491499998, &#39;std&#39;: 0.6089890616715313}
+unoptimized: {&#39;mean&#39;: 490.83991727999893, &#39;median&#39;: 491.0158592500011, &#39;std&#39;: 0.36641131670863597}
 </pre></div>
 </div>
 </div>
@@ -909,7 +905,7 @@ models.</p>
 <p>Here we presented a simple example using ResNet-50 v2 locally. However, TVM
 supports many more features including cross-compilation, remote execution and
 profiling/benchmarking.</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 11 minutes  2.536 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 6 minutes  57.341 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorial-autotvm-relay-x86-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../_downloads/57a45d9bef1af358191e7d50043e652c/autotvm_relay_x86.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">autotvm_relay_x86.py</span></code></a></p>
diff --git a/docs/tutorial/cross_compilation_and_rpc.html b/docs/tutorial/cross_compilation_and_rpc.html
index 60d5a49c5..461f9bb0b 100644
--- a/docs/tutorial/cross_compilation_and_rpc.html
+++ b/docs/tutorial/cross_compilation_and_rpc.html
@@ -496,7 +496,7 @@ device and returns the measured cost. Network overhead is excluded.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>1.277e-07 secs/op
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>1.242e-07 secs/op
 </pre></div>
 </div>
 </div>
diff --git a/docs/tutorial/intro_topi.html b/docs/tutorial/intro_topi.html
index a024766b1..7c6b4cbf3 100644
--- a/docs/tutorial/intro_topi.html
+++ b/docs/tutorial/intro_topi.html
@@ -458,7 +458,7 @@ we can schedule the following series of operations ending with <code class="code
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>[stage(a, placeholder(a, 0x1662a780)), stage(b, placeholder(b, 0x21590010)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[ [...]
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>[stage(a, placeholder(a, 0xf2872e0)), stage(b, placeholder(b, 0xd08cf10)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[it [...]
 </pre></div>
 </div>
 <p>We can test the correctness by comparing with <code class="code docutils literal notranslate"><span class="pre">numpy</span></code> result as follows</p>
diff --git a/docs/tutorial/sg_execution_times.html b/docs/tutorial/sg_execution_times.html
index 438718735..34db52a07 100644
--- a/docs/tutorial/sg_execution_times.html
+++ b/docs/tutorial/sg_execution_times.html
@@ -300,20 +300,20 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-tutorial-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>13:31.365</strong> total execution time for <strong>tutorial</strong> files:</p>
+<p><strong>09:49.170</strong> total execution time for <strong>tutorial</strong> files:</p>
 <ul class="simple">
-<li><p><strong>11:02.536</strong>: <a class="reference internal" href="autotvm_relay_x86.html#sphx-glr-tutorial-autotvm-relay-x86-py"><span class="std std-ref">Compiling and Optimizing a Model with the Python Interface (AutoTVM)</span></a> (<code class="docutils literal notranslate"><span class="pre">autotvm_relay_x86.py</span></code>)</p></li>
-<li><p><strong>01:01.157</strong>: <a class="reference internal" href="tensor_expr_get_started.html#sphx-glr-tutorial-tensor-expr-get-started-py"><span class="std std-ref">Working with Operators Using Tensor Expression</span></a> (<code class="docutils literal notranslate"><span class="pre">tensor_expr_get_started.py</span></code>)</p></li>
-<li><p><strong>00:42.059</strong>: <a class="reference internal" href="auto_scheduler_matmul_x86.html#sphx-glr-tutorial-auto-scheduler-matmul-x86-py"><span class="std std-ref">Optimizing Operators with Auto-scheduling</span></a> (<code class="docutils literal notranslate"><span class="pre">auto_scheduler_matmul_x86.py</span></code>)</p></li>
-<li><p><strong>00:26.595</strong>: <a class="reference internal" href="relay_quick_start.html#sphx-glr-tutorial-relay-quick-start-py"><span class="std std-ref">Quick Start Tutorial for Compiling Deep Learning Models</span></a> (<code class="docutils literal notranslate"><span class="pre">relay_quick_start.py</span></code>)</p></li>
-<li><p><strong>00:16.821</strong>: <a class="reference internal" href="autotvm_matmul_x86.html#sphx-glr-tutorial-autotvm-matmul-x86-py"><span class="std std-ref">Optimizing Operators with Schedule Templates and AutoTVM</span></a> (<code class="docutils literal notranslate"><span class="pre">autotvm_matmul_x86.py</span></code>)</p></li>
-<li><p><strong>00:01.050</strong>: <a class="reference internal" href="tensor_ir_blitz_course.html#sphx-glr-tutorial-tensor-ir-blitz-course-py"><span class="std std-ref">Blitz Course to TensorIR</span></a> (<code class="docutils literal notranslate"><span class="pre">tensor_ir_blitz_course.py</span></code>)</p></li>
-<li><p><strong>00:00.730</strong>: <a class="reference internal" href="intro_topi.html#sphx-glr-tutorial-intro-topi-py"><span class="std std-ref">Introduction to TOPI</span></a> (<code class="docutils literal notranslate"><span class="pre">intro_topi.py</span></code>)</p></li>
-<li><p><strong>00:00.229</strong>: <a class="reference internal" href="cross_compilation_and_rpc.html#sphx-glr-tutorial-cross-compilation-and-rpc-py"><span class="std std-ref">Cross Compilation and RPC</span></a> (<code class="docutils literal notranslate"><span class="pre">cross_compilation_and_rpc.py</span></code>)</p></li>
-<li><p><strong>00:00.050</strong>: <a class="reference internal" href="introduction.html#sphx-glr-tutorial-introduction-py"><span class="std std-ref">Introduction</span></a> (<code class="docutils literal notranslate"><span class="pre">introduction.py</span></code>)</p></li>
-<li><p><strong>00:00.049</strong>: <a class="reference internal" href="tvmc_python.html#sphx-glr-tutorial-tvmc-python-py"><span class="std std-ref">Getting Starting using TVMC Python: a high-level API for TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">tvmc_python.py</span></code>)</p></li>
-<li><p><strong>00:00.045</strong>: <a class="reference internal" href="tvmc_command_line_driver.html#sphx-glr-tutorial-tvmc-command-line-driver-py"><span class="std std-ref">Compiling and Optimizing a Model with TVMC</span></a> (<code class="docutils literal notranslate"><span class="pre">tvmc_command_line_driver.py</span></code>)</p></li>
-<li><p><strong>00:00.043</strong>: <a class="reference internal" href="install.html#sphx-glr-tutorial-install-py"><span class="std std-ref">Installing TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">install.py</span></code>)</p></li>
+<li><p><strong>06:57.341</strong>: <a class="reference internal" href="autotvm_relay_x86.html#sphx-glr-tutorial-autotvm-relay-x86-py"><span class="std std-ref">Compiling and Optimizing a Model with the Python Interface (AutoTVM)</span></a> (<code class="docutils literal notranslate"><span class="pre">autotvm_relay_x86.py</span></code>)</p></li>
+<li><p><strong>01:00.100</strong>: <a class="reference internal" href="auto_scheduler_matmul_x86.html#sphx-glr-tutorial-auto-scheduler-matmul-x86-py"><span class="std std-ref">Optimizing Operators with Auto-scheduling</span></a> (<code class="docutils literal notranslate"><span class="pre">auto_scheduler_matmul_x86.py</span></code>)</p></li>
+<li><p><strong>00:59.103</strong>: <a class="reference internal" href="tensor_expr_get_started.html#sphx-glr-tutorial-tensor-expr-get-started-py"><span class="std std-ref">Working with Operators Using Tensor Expression</span></a> (<code class="docutils literal notranslate"><span class="pre">tensor_expr_get_started.py</span></code>)</p></li>
+<li><p><strong>00:25.831</strong>: <a class="reference internal" href="relay_quick_start.html#sphx-glr-tutorial-relay-quick-start-py"><span class="std std-ref">Quick Start Tutorial for Compiling Deep Learning Models</span></a> (<code class="docutils literal notranslate"><span class="pre">relay_quick_start.py</span></code>)</p></li>
+<li><p><strong>00:24.661</strong>: <a class="reference internal" href="autotvm_matmul_x86.html#sphx-glr-tutorial-autotvm-matmul-x86-py"><span class="std std-ref">Optimizing Operators with Schedule Templates and AutoTVM</span></a> (<code class="docutils literal notranslate"><span class="pre">autotvm_matmul_x86.py</span></code>)</p></li>
+<li><p><strong>00:01.101</strong>: <a class="reference internal" href="tensor_ir_blitz_course.html#sphx-glr-tutorial-tensor-ir-blitz-course-py"><span class="std std-ref">Blitz Course to TensorIR</span></a> (<code class="docutils literal notranslate"><span class="pre">tensor_ir_blitz_course.py</span></code>)</p></li>
+<li><p><strong>00:00.704</strong>: <a class="reference internal" href="intro_topi.html#sphx-glr-tutorial-intro-topi-py"><span class="std std-ref">Introduction to TOPI</span></a> (<code class="docutils literal notranslate"><span class="pre">intro_topi.py</span></code>)</p></li>
+<li><p><strong>00:00.198</strong>: <a class="reference internal" href="cross_compilation_and_rpc.html#sphx-glr-tutorial-cross-compilation-and-rpc-py"><span class="std std-ref">Cross Compilation and RPC</span></a> (<code class="docutils literal notranslate"><span class="pre">cross_compilation_and_rpc.py</span></code>)</p></li>
+<li><p><strong>00:00.039</strong>: <a class="reference internal" href="introduction.html#sphx-glr-tutorial-introduction-py"><span class="std std-ref">Introduction</span></a> (<code class="docutils literal notranslate"><span class="pre">introduction.py</span></code>)</p></li>
+<li><p><strong>00:00.031</strong>: <a class="reference internal" href="tvmc_python.html#sphx-glr-tutorial-tvmc-python-py"><span class="std std-ref">Getting Starting using TVMC Python: a high-level API for TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">tvmc_python.py</span></code>)</p></li>
+<li><p><strong>00:00.031</strong>: <a class="reference internal" href="install.html#sphx-glr-tutorial-install-py"><span class="std std-ref">Installing TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">install.py</span></code>)</p></li>
+<li><p><strong>00:00.031</strong>: <a class="reference internal" href="tvmc_command_line_driver.html#sphx-glr-tutorial-tvmc-command-line-driver-py"><span class="std std-ref">Compiling and Optimizing a Model with TVMC</span></a> (<code class="docutils literal notranslate"><span class="pre">tvmc_command_line_driver.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/tutorial/tensor_expr_get_started.html b/docs/tutorial/tensor_expr_get_started.html
index a0952138f..7510a6380 100644
--- a/docs/tutorial/tensor_expr_get_started.html
+++ b/docs/tutorial/tensor_expr_get_started.html
@@ -507,7 +507,7 @@ helper function to run a profile of the TVM generated code.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.000007
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.000009
 naive: 0.000006
 </pre></div>
 </div>
@@ -598,7 +598,7 @@ factor to be the number of threads on your CPU.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>vector: 0.000025
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>vector: 0.000027
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [(stride: int32*n: int32)], [], type=&quot;auto&quot;),
@@ -631,10 +631,10 @@ factor to be the number of threads on your CPU.</p>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Operator                  Timing             Performance
-   numpy    7.338310001614445e-06                    1.0
-   naive              5.9001e-06      0.8040134579626596
-parallel              6.0426e-06      0.8234320979449784
-  vector             2.46126e-05       3.353987497746098
+   numpy    9.022220000360903e-06                    1.0
+   naive    5.8499000000000005e-06    0.6483880907100464
+parallel              6.0838e-06      0.6743129739417392
+  vector    2.6628699999999998e-05    2.9514576233936665
 </pre></div>
 </div>
 <div class="admonition-code-specialization admonition">
@@ -952,7 +952,7 @@ matrix multiplication.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.018640
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.018395
 </pre></div>
 </div>
 <p>Now we write a basic matrix multiplication using TVM TE and verify that it
@@ -994,7 +994,7 @@ optimizations.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>none: 3.393963
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>none: 3.289817
 </pre></div>
 </div>
 <p>Let’s take a look at the intermediate representation of the operator and
@@ -1060,7 +1060,7 @@ schedule.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>blocking: 0.326739
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>blocking: 0.294380
 </pre></div>
 </div>
 <p>By reordering the computation to take advantage of caching, you should see a
@@ -1120,7 +1120,7 @@ already cache friendly from our previous optimizations.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>vectorization: 0.342312
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>vectorization: 0.330446
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1175,7 +1175,7 @@ more cache friendly.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>loop permutation: 0.121740
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>loop permutation: 0.114469
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1251,7 +1251,7 @@ optimized schedule.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>array packing: 0.110851
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>array packing: 0.108735
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1325,7 +1325,7 @@ to `C</cite> when all the block results are ready.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>block caching: 0.111037
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>block caching: 0.110104
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1392,7 +1392,7 @@ of thread-level parallelization.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>parallelization: 0.144317
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>parallelization: 0.144070
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1454,13 +1454,13 @@ working, we can compare the results.</p>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>        Operator                  Timing             Performance
-            none      3.3939625678999996                     1.0
-        blocking     0.32673893090000006     0.09627063480024425
-   vectorization            0.3423118515     0.10085905328997309
-loop permutation     0.12173995230000001    0.035869562455229466
-   array packing     0.11085103339999999     0.03266124218588204
-   block caching     0.11103668380000001     0.03271594237667256
- parallelization              0.14431674     0.04252160626783088
+            none            3.2898165229                     1.0
+        blocking            0.2943796007     0.08948207252619123
+   vectorization     0.33044605889999995     0.10044513321633788
+loop permutation     0.11446943409999999    0.034795081519954876
+   array packing     0.10873529960000002    0.033052086292079584
+   block caching     0.11010433149999999     0.03346822861809391
+ parallelization            0.1440697348     0.04379263518106516
 </pre></div>
 </div>
 <p>Note that the outputs on the web page reflect the running times on a
@@ -1492,7 +1492,6 @@ is</p>
 you can build generic templates of the matrix multiplication and other
 operations with tunable parameters that allows you to automatically optimize
 the computation for specific platforms.</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  1.157 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorial-tensor-expr-get-started-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../_downloads/40a01cffb015a67aaec0fad7e27cf80d/tensor_expr_get_started.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tensor_expr_get_started.py</span></code></a></p>