You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by tq...@apache.org on 2022/07/01 17:44:55 UTC
[tvm-site] branch asf-site updated: deploying docs (apache/tvm@0ae3f5d6ce6f150dd038f59429ab1da4fadea177)

This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/tvm-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 87682a9a9 deploying docs (apache/tvm@0ae3f5d6ce6f150dd038f59429ab1da4fadea177)
87682a9a9 is described below

commit 87682a9a9ef47be33579ec5294f13126334c80fe
Author: tvm-bot <95...@users.noreply.github.com>
AuthorDate: Fri Jul 1 17:44:48 2022 +0000

    deploying docs (apache/tvm@0ae3f5d6ce6f150dd038f59429ab1da4fadea177)
---
 docs/_sources/contribute/ci.rst.txt                |  26 +-
 .../how_to/compile_models/from_darknet.rst.txt     |   5 -
 .../how_to/compile_models/from_mxnet.rst.txt       |   2 +-
 .../how_to/compile_models/from_oneflow.rst.txt     |   2 +-
 .../how_to/compile_models/from_pytorch.rst.txt     |   2 +-
 .../how_to/compile_models/from_tensorflow.rst.txt  |   5 -
 .../compile_models/sg_execution_times.rst.txt      |  22 +-
 .../deploy_models/deploy_model_on_android.rst.txt  |   2 +-
 .../deploy_object_detection_pytorch.rst.txt        |   4 +-
 .../deploy_models/deploy_prequantized.rst.txt      |   6 +-
 .../deploy_prequantized_tflite.rst.txt             |   4 +-
 .../how_to/deploy_models/deploy_quantized.rst.txt  |   2 +-
 .../deploy_models/deploy_ssd_gluoncv.rst.txt       |   4 +-
 .../deploy_models/sg_execution_times.rst.txt       |  16 +-
 .../extend_tvm/bring_your_own_datatypes.rst.txt    |   2 +-
 .../how_to/extend_tvm/sg_execution_times.rst.txt   |  10 +-
 .../how_to/extend_tvm/use_pass_instrument.rst.txt  |  16 +-
 .../optimize_operators/opt_conv_cuda.rst.txt       |   2 +-
 .../optimize_operators/opt_conv_tensorcore.rst.txt |   2 +-
 .../how_to/optimize_operators/opt_gemm.rst.txt     |  16 +-
 .../optimize_operators/sg_execution_times.rst.txt  |   8 +-
 .../sg_execution_times.rst.txt                     |  14 +-
 .../tune_conv2d_layer_cuda.rst.txt                 | 558 +++++++++++++++++----
 .../tune_network_cuda.rst.txt                      |   2 +-
 .../tune_network_x86.rst.txt                       |   4 +-
 .../tune_sparse_x86.rst.txt                        | 116 +----
 .../tune_with_autotvm/sg_execution_times.rst.txt   |   6 +-
 .../tune_with_autotvm/tune_conv2d_cuda.rst.txt     |  34 +-
 .../work_with_microtvm/micro_autotune.rst.txt      |  16 +-
 .../how_to/work_with_microtvm/micro_train.rst.txt  |  16 +-
 .../work_with_microtvm/sg_execution_times.rst.txt  |   8 +-
 .../work_with_relay/sg_execution_times.rst.txt     |   6 +-
 .../how_to/work_with_schedules/intrin_math.rst.txt |   2 +-
 .../work_with_schedules/sg_execution_times.rst.txt |  14 +-
 .../how_to/work_with_schedules/tensorize.rst.txt   |   2 +-
 .../tutorials/autotvm/sg_execution_times.rst.txt   |   4 +-
 .../frontend/deploy_classification.rst.txt         |   2 +-
 .../tutorials/frontend/deploy_detection.rst.txt    |   2 +-
 .../tutorials/frontend/sg_execution_times.rst.txt  |   6 +-
 .../tutorials/optimize/sg_execution_times.rst.txt  |   6 +-
 .../topic/vta/tutorials/sg_execution_times.rst.txt |   6 +-
 .../tutorial/auto_scheduler_matmul_x86.rst.txt     |  14 +-
 docs/_sources/tutorial/autotvm_matmul_x86.rst.txt  |  20 +-
 docs/_sources/tutorial/autotvm_relay_x86.rst.txt   |  54 +-
 .../tutorial/cross_compilation_and_rpc.rst.txt     |   2 +-
 docs/_sources/tutorial/intro_topi.rst.txt          |   2 +-
 docs/_sources/tutorial/sg_execution_times.rst.txt  |  24 +-
 .../tutorial/tensor_expr_get_started.rst.txt       |  44 +-
 docs/commit_hash                                   |   2 +-
 docs/contribute/ci.html                            |  71 ++-
 docs/how_to/compile_models/from_darknet.html       |   1 -
 docs/how_to/compile_models/from_mxnet.html         |   2 +-
 docs/how_to/compile_models/from_oneflow.html       |  12 +-
 docs/how_to/compile_models/from_pytorch.html       |  10 +-
 docs/how_to/compile_models/from_tensorflow.html    |   1 -
 docs/how_to/compile_models/sg_execution_times.html |  32 +-
 .../deploy_models/deploy_model_on_android.html     |   2 +-
 .../deploy_object_detection_pytorch.html           |  45 +-
 docs/how_to/deploy_models/deploy_prequantized.html |  14 +-
 .../deploy_models/deploy_prequantized_tflite.html  |   4 +-
 docs/how_to/deploy_models/deploy_quantized.html    |   2 +-
 docs/how_to/deploy_models/deploy_ssd_gluoncv.html  |  38 +-
 docs/how_to/deploy_models/sg_execution_times.html  |  16 +-
 .../extend_tvm/bring_your_own_datatypes.html       |   2 +-
 docs/how_to/extend_tvm/sg_execution_times.html     |  10 +-
 docs/how_to/extend_tvm/use_pass_instrument.html    |  16 +-
 docs/how_to/optimize_operators/opt_conv_cuda.html  |   2 +-
 .../optimize_operators/opt_conv_tensorcore.html    |   2 +-
 docs/how_to/optimize_operators/opt_gemm.html       |  16 +-
 .../optimize_operators/sg_execution_times.html     |   8 +-
 .../sg_execution_times.html                        |  14 +-
 .../tune_conv2d_layer_cuda.html                    | 554 ++++++++++++++++----
 .../tune_with_autoscheduler/tune_network_cuda.html |   2 +-
 .../tune_with_autoscheduler/tune_network_x86.html  |   4 +-
 .../tune_with_autoscheduler/tune_sparse_x86.html   | 116 +----
 .../tune_with_autotvm/sg_execution_times.html      |   6 +-
 .../how_to/tune_with_autotvm/tune_conv2d_cuda.html |  34 +-
 docs/how_to/work_with_microtvm/micro_autotune.html |  16 +-
 docs/how_to/work_with_microtvm/micro_train.html    |  16 +-
 .../work_with_microtvm/sg_execution_times.html     |   8 +-
 .../how_to/work_with_relay/sg_execution_times.html |   6 +-
 docs/how_to/work_with_schedules/intrin_math.html   |   2 +-
 .../work_with_schedules/sg_execution_times.html    |  18 +-
 docs/how_to/work_with_schedules/tensorize.html     |   2 +-
 ...meta__schedule_1_1FeatureExtractor-members.html |   2 +-
 ...stvm_1_1meta__schedule_1_1FeatureExtractor.html |  19 +-
 .../api/doxygen/feature__extractor_8h_source.html  |   2 +-
 docs/reference/api/doxygen/functions_func_p.html   |   2 +-
 docs/reference/api/doxygen/functions_p.html        |   2 +-
 docs/reference/api/doxygen/search/all_11.js        |   2 +-
 docs/reference/api/doxygen/search/functions_10.js  |   2 +-
 docs/reference/api/python/auto_scheduler.html      |   4 +-
 docs/reference/api/python/topi.html                |   4 +-
 .../api/typedoc/classes/bytestreamreader.html      |  12 +-
 .../api/typedoc/classes/cachedcallstack.html       |  34 +-
 docs/reference/api/typedoc/classes/dldatatype.html |  12 +-
 docs/reference/api/typedoc/classes/dldevice.html   |  10 +-
 .../reference/api/typedoc/classes/environment.html |  12 +-
 docs/reference/api/typedoc/classes/ffilibrary.html |  20 +-
 .../api/typedoc/classes/graphexecutor.html         |  16 +-
 docs/reference/api/typedoc/classes/instance.html   |  40 +-
 docs/reference/api/typedoc/classes/memory.html     |  34 +-
 docs/reference/api/typedoc/classes/module.html     |  10 +-
 docs/reference/api/typedoc/classes/ndarray.html    |  22 +-
 .../api/typedoc/classes/packedfunccell.html        |   6 +-
 docs/reference/api/typedoc/classes/rpcserver.html  |  14 +-
 docs/reference/api/typedoc/classes/scalar.html     |   6 +-
 .../api/typedoc/classes/webgpucontext.html         |  12 +-
 docs/reference/api/typedoc/enums/argtypecode.html  |  30 +-
 .../api/typedoc/enums/aynccallbackcode.html        |   4 +-
 .../api/typedoc/enums/dldatatypecode.html          |   8 +-
 .../api/typedoc/enums/rpcserverstate.html          |  12 +-
 docs/reference/api/typedoc/enums/sizeof.html       |  18 +-
 docs/reference/api/typedoc/index.html              | 112 ++---
 .../api/typedoc/interfaces/disposable.html         |   2 +-
 .../api/typedoc/interfaces/functioninfo.html       |   6 +-
 .../api/typedoc/interfaces/libraryprovider.html    |   4 +-
 docs/searchindex.js                                |   2 +-
 .../vta/tutorials/autotvm/sg_execution_times.html  |   4 +-
 .../tutorials/frontend/deploy_classification.html  |   2 +-
 .../vta/tutorials/frontend/deploy_detection.html   |   2 +-
 .../vta/tutorials/frontend/sg_execution_times.html |   6 +-
 .../vta/tutorials/optimize/sg_execution_times.html |   6 +-
 docs/topic/vta/tutorials/sg_execution_times.html   |   6 +-
 docs/tutorial/auto_scheduler_matmul_x86.html       |   6 +-
 docs/tutorial/autotvm_matmul_x86.html              |  20 +-
 docs/tutorial/autotvm_relay_x86.html               | 256 +++++-----
 docs/tutorial/cross_compilation_and_rpc.html       |   2 +-
 docs/tutorial/intro_topi.html                      |   2 +-
 docs/tutorial/sg_execution_times.html              |  28 +-
 docs/tutorial/tensor_expr_get_started.html         |  44 +-
 131 files changed, 1857 insertions(+), 1272 deletions(-)

diff --git a/docs/_sources/contribute/ci.rst.txt b/docs/_sources/contribute/ci.rst.txt
index 9a2876220..a421103ab 100644
--- a/docs/_sources/contribute/ci.rst.txt
+++ b/docs/_sources/contribute/ci.rst.txt
@@ -31,7 +31,7 @@ Jenkins is the only CI step that is codified to block merging. TVM is also teste
 against Windows and MacOS using GitHub Actions.
 
 This page describes how contributors and committers can use TVM's CI to verify their code. You can
-read more about the design of TVM CI in the
+read more about the design of TVM CI in the `tlc-pack/ci <https://github.com/tlc-pack/ci>`_ repo.
 
 For Contributors
 ----------------
@@ -164,6 +164,30 @@ be re-triggered by another ``git push``).
    git push my_repo
 
 
+Docker Images
+^^^^^^^^^^^^^
+
+Each CI job runs most of its work inside a Docker container, built from files
+in the `docker/ <https://github.com/apache/tvm/tree/main/docker>`_ folder. These
+files are built nightly in Jenkins via the `docker-images-ci <https://ci.tlcpack.ai/job/docker-images-ci/>`_ job.
+The images for these containers are hosted in the `tlcpack Docker Hub <https://hub.docker.com/u/tlcpack>`_
+and referenced in the `Jenkinsfile.j2 <https://github.com/apache/tvm/tree/main/Jenkinsfile.j2>`_. These can be inspected and run
+locally via standard Docker commands.
+
+
+``ci-docker-staging``
+^^^^^^^^^^^^^^^^^^^^^
+
+The `ci-docker-staging <https://github.com/apache/tvm/tree/ci-docker-staging>`_
+branch is typically used to test updates to Docker images and ``Jenkinsfile`` changes. When
+running a build for a normal PR from a forked repository, Jenkins uses the code
+from the PR except for the ``Jenkinsfile`` itself, which comes from the base branch.
+When branches are built, the ``Jenkinsfile`` in the branch is used, so a committer
+with write access must push PRs to a branch in apache/tvm to properly test
+``Jenkinsfile`` changes. If your PR makes changes to the ``Jenkinsfile``, make sure
+to @ a `committer <https://github.com/apache/tvm/tree/main/CONTRIBUTORS.md>`_
+and ask them to push your PR as a branch to test the changes.
+
 
 CI Monitoring Rotation
 ^^^^^^^^^^^^^^^^^^^^^^
diff --git a/docs/_sources/how_to/compile_models/from_darknet.rst.txt b/docs/_sources/how_to/compile_models/from_darknet.rst.txt
index 9feee013a..c05aaa119 100644
--- a/docs/_sources/how_to/compile_models/from_darknet.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_darknet.rst.txt
@@ -315,11 +315,6 @@ The process is no different from other examples.
 
 
 
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 1 minutes  2.506 seconds)
-
-
 .. _sphx_glr_download_how_to_compile_models_from_darknet.py:
 
 .. only:: html
diff --git a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
index 80d0d09ea..ce2600b5d 100644
--- a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
@@ -115,7 +115,7 @@ In this section, we download a pretrained imagenet model and classify an image.
 
  .. code-block:: none
 
-    Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip8a204fab-2757-45a3-a387-97883374e7a8 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
+    Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip1e8cea2c-47ab-462c-b3c7-021f6d7fcddc from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
     x (1, 3, 224, 224)
 
 
diff --git a/docs/_sources/how_to/compile_models/from_oneflow.rst.txt b/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
index adc019ae4..33e0a113f 100644
--- a/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
@@ -113,7 +113,7 @@ Load a pretrained OneFlow model and save model
  .. code-block:: none
 
     Downloading: "https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip" to /workspace/.oneflow/flowvision_cache/resnet18.zip
-
      0%|          | 0.00/41.5M [00:00<?, ?B/s]
     19%|#9        | 7.99M/41.5M [00:00<00:00, 61.4MB/s]
     39%|###8      | 16.0M/41.5M [00:00<00:00, 58.7MB/s]
     58%|#####7    | 24.0M/41.5M [00:00<00:00, 53.9MB/s]
     76%|#######6  | 31.7M/41.5M [00:00<00:00, 56.5MB/s]
     90%|########9 | 37.2M/41.5M [00:00<00:00, 53.6MB/s]
    100%|##########| 41.5M/41.5M [00:00<00:00, 50.4MB/s]
+
      0%|          | 0.00/41.5M [00:00<?, ?B/s]
     19%|#9        | 7.99M/41.5M [00:00<00:00, 52.5MB/s]
     39%|###8      | 16.0M/41.5M [00:00<00:00, 53.4MB/s]
     54%|#####3    | 22.3M/41.5M [00:00<00:00, 50.4MB/s]
     65%|######5   | 27.1M/41.5M [00:00<00:00, 47.8MB/s]
     92%|#########2| 38.3M/41.5M [00:00<00:00, 52.9MB/s]
    100%|##########| 41.5M/41.5M [00:00<00:00, 53.0MB/s]
 
 
 
diff --git a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
index 506c781b0..159e33e83 100644
--- a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
@@ -94,7 +94,7 @@ Load a pretrained PyTorch model
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /workspace/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
-
      0%|          | 0.00/44.7M [00:00<?, ?B/s]
     10%|9         | 4.34M/44.7M [00:00<00:00, 45.5MB/s]
     19%|#9        | 8.67M/44.7M [00:00<00:00, 45.3MB/s]
     69%|######9   | 31.0M/44.7M [00:00<00:00, 131MB/s] 
    100%|##########| 44.7M/44.7M [00:00<00:00, 128MB/s]
+
      0%|          | 0.00/44.7M [00:00<?, ?B/s]
      3%|2         | 1.29M/44.7M [00:00<00:03, 13.3MB/s]
      9%|8         | 3.87M/44.7M [00:00<00:02, 21.3MB/s]
     21%|##        | 9.27M/44.7M [00:00<00:00, 37.3MB/s]
     47%|####6     | 20.9M/44.7M [00:00<00:00, 70.5MB/s]
     96%|#########5| 42.8M/44.7M [00:00<00:00, 128MB/s] 
    100%|##########| 44.7M/44.7M [00:00<00:00, 91.8MB/s]
 
 
 
diff --git a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
index d1030ddb8..009e690a4 100644
--- a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
@@ -421,11 +421,6 @@ Run the corresponding model on tensorflow
 
 
 
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 1 minutes  3.586 seconds)
-
-
 .. _sphx_glr_download_how_to_compile_models_from_tensorflow.py:
 
 .. only:: html
diff --git a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
index 14c0b6851..7bf8741b4 100644
--- a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
@@ -5,26 +5,26 @@
 
 Computation times
 =================
-**04:57.533** total execution time for **how_to_compile_models** files:
+**05:08.856** total execution time for **how_to_compile_models** files:
 
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``) | 01:03.586 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``) | 00:59.479 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)       | 01:02.506 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)       | 00:59.268 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)         | 00:39.520 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)         | 00:40.230 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)         | 00:25.939 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)           | 00:32.086 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_oneflow.py` (``from_oneflow.py``)       | 00:25.603 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_oneflow.py` (``from_oneflow.py``)       | 00:25.277 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)           | 00:22.957 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)         | 00:24.849 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)         | 00:21.666 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)           | 00:23.270 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)       | 00:19.331 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)         | 00:22.769 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)           | 00:14.019 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)       | 00:19.076 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)             | 00:02.406 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)             | 00:02.552 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
index e489fdb58..92e6fb94c 100644
--- a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
@@ -441,7 +441,7 @@ Execute on TVM
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      15.8706      15.8903      15.9835      15.7127       0.0826   
+      15.7614      15.6817      16.6102      15.5854       0.2887   
                
 
 
diff --git a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
index 6419d7936..2fd4ad9ee 100644
--- a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
@@ -123,7 +123,7 @@ Load pre-trained maskrcnn from torchvision and do tracing
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /workspace/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
-
      0%|          | 0.00/170M [00:00<?, ?B/s]
      3%|3         | 5.31M/170M [00:00<00:03, 55.2MB/s]
      6%|6         | 10.6M/170M [00:00<00:03, 42.6MB/s]
      9%|8         | 14.8M/170M [00:00<00:04, 33.3MB/s]
     12%|#1        | 19.9M/170M [00:00<00:03, 39.4MB/s]
     14%|#4        | 24.2M/170M [00:00<00:03, 41.2MB/s]
     18%|#7        | 30.3M/170M [00:00<00:03, 47.9MB/s]
     21%|##        | 35.1M/170M [00:00<00:03, 45.9MB/s]
     23%|##3       | 39.7M/170M [00:00<00:03, 41.0MB/s]
     26%|##5       | 43.8M/170M [00:01<00:03, 39.3MB/s]
     29%|##8       | 49.0M/170M [00:01<00:02, 43.4MB/s]
     31%|###1      | 53.3M/170M [00:01<00:02, 41.8MB/s]
     34%|###3      | 57.4M/170M [00:01<00:02, 41.0MB/s]
     37%|###6      | 62.1M/170M [00:01<00:02, 43.3MB/s]
     40%|####      | 67.9M/170M [00:01<00:02, 48.2MB/s]
     44%|####3     | 74.1M/170M [00:01<00:01, 53.0MB/s]
     47%|####6     | 79.3M/170M [00:01<00:01, 48.1MB/s]
     49%|####9     | 84.0M/170M [00:02<00:02, 39.3MB/
 s]
     52%|#####2    | 89.1M/170M [00:02<00:01, 42.8MB/s]
     55%|#####5    | 93.9M/170M [00:02<00:01, 44.5MB/s]
     58%|#####7    | 98.4M/170M [00:02<00:01, 45.0MB/s]
     61%|######    | 103M/170M [00:02<00:01, 42.2MB/s] 
     63%|######3   | 107M/170M [00:02<00:01, 39.5MB/s]
     66%|######6   | 113M/170M [00:02<00:01, 45.0MB/s]
     69%|######9   | 117M/170M [00:02<00:01, 43.7MB/s]
     72%|#######1  | 122M/170M [00:02<00:01, 45.0MB/s]
     74%|#######4  | 126M/170M [00:03<00:00, 45.6MB/s]
     78%|#######8  | 133M/170M [00:03<00:00, 50.7MB/s]
     81%|########  | 137M/170M [00:03<00:00, 50.8MB/s]
     84%|########3 | 142M/170M [00:03<00:00, 48.9MB/s]
     87%|########7 | 148M/170M [00:03<00:00, 52.0MB/s]
     91%|######### | 154M/170M [00:03<00:00, 54.1MB/s]
     94%|#########3| 159M/170M [00:03<00:00, 49.7MB/s]
     96%|#########6| 164M/170M [00:03<00:00, 49.8MB/s]
    100%|#########9| 170M/170M [00:03<00:00, 52.6MB/s]
    100%|##########| 170M/170M [00:03<00:00, 45.6MB/s]
+
      0%|          | 0.00/170M [00:00<?, ?B/s]
     11%|#         | 18.4M/170M [00:00<00:00, 193MB/s]
     25%|##4       | 42.2M/170M [00:00<00:00, 226MB/s]
     39%|###8      | 65.8M/170M [00:00<00:00, 236MB/s]
     53%|#####3    | 90.7M/170M [00:00<00:00, 246MB/s]
     68%|######7   | 115M/170M [00:00<00:00, 249MB/s] 
     82%|########1 | 139M/170M [00:00<00:00, 249MB/s]
     96%|#########5| 163M/170M [00:00<00:00, 226MB/s]
    100%|##########| 170M/170M [00:00<00:00, 234MB/s]
     /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3878: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
       for i in range(dim)
     /usr/local/lib/python3.7/dist-packages/torchvision/models/detection/anchor_utils.py:127: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
@@ -292,7 +292,7 @@ Get boxes with score larger than 0.9
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  58.197 seconds)
+   **Total running time of the script:** ( 2 minutes  57.913 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_object_detection_pytorch.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
index 4d5a7547d..7e76d8b43 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
@@ -232,7 +232,7 @@ training. Other models require a full post training calibration.
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /workspace/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
-
      0%|          | 0.00/13.6M [00:00<?, ?B/s]
      4%|4         | 576k/13.6M [00:00<00:02, 5.87MB/s]
     12%|#1        | 1.56M/13.6M [00:00<00:01, 8.56MB/s]
     24%|##4       | 3.30M/13.6M [00:00<00:00, 12.9MB/s]
     47%|####7     | 6.38M/13.6M [00:00<00:00, 20.5MB/s]
     87%|########6 | 11.8M/13.6M [00:00<00:00, 33.4MB/s]
    100%|##########| 13.6M/13.6M [00:00<00:00, 27.0MB/s]
+
      0%|          | 0.00/13.6M [00:00<?, ?B/s]
      8%|8         | 1.09M/13.6M [00:00<00:01, 11.4MB/s]
     25%|##5       | 3.41M/13.6M [00:00<00:00, 19.0MB/s]
     59%|#####9    | 8.02M/13.6M [00:00<00:00, 32.3MB/s]
    100%|##########| 13.6M/13.6M [00:00<00:00, 38.8MB/s]
 
 
 
@@ -412,7 +412,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      90.5050      90.4569      93.1727      90.1209       0.3688   
+      90.3646      90.2865      92.6330      90.2238       0.2759   
                
 
 
@@ -461,7 +461,7 @@ TODO
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  6.790 seconds)
+   **Total running time of the script:** ( 1 minutes  10.470 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_prequantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
index 07be8db86..3fb620be5 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
@@ -439,7 +439,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      119.9239     119.8118     124.9368     119.2458      0.7291   
+      119.3345     119.1992     127.5463     118.7020      0.8721   
                
 
 
@@ -476,7 +476,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  57.294 seconds)
+   **Total running time of the script:** ( 1 minutes  52.020 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_prequantized_tflite.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
index ff5664a8d..e6ba859f0 100644
--- a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
@@ -255,7 +255,7 @@ We create a Relay VM to build and execute the model.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  31.072 seconds)
+   **Total running time of the script:** ( 1 minutes  25.102 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_quantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
index 2d0e03b4e..8c2a32c20 100644
--- a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
@@ -158,7 +158,7 @@ Convert and compile model for CPU.
             data: None
       input_sym_arg_type = in_param.infer_type()[0]
     Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/ssd_512_resnet50_v1_voc-9c8b225a.zip...
-
      0%|          | 0/132723 [00:00<?, ?KB/s]
      4%|3         | 5091/132723 [00:00<00:02, 50903.37KB/s]
      9%|9         | 12486/132723 [00:00<00:01, 64456.81KB/s]
     15%|#5        | 20121/132723 [00:00<00:01, 69882.30KB/s]
     21%|##        | 27549/132723 [00:00<00:01, 71616.07KB/s]
     27%|##6       | 35259/132723 [00:00<00:01, 73591.42KB/s]
     32%|###2      | 42899/132723 [00:00<00:01, 74543.16KB/s]
     38%|###7      | 50400/132723 [00:00<00:01, 74687.76KB/s]
     44%|####3     | 58133/132723 [00:00<00:00, 75525.12KB/s]
     50%|####9     | 66157/132723 [00:00<00:00, 76994.90KB/s]
     56%|#####5    | 74160/132723 [00:01<00:00, 77927.34KB/s]
     62%|######1   | 82120/132723 [00:01<00:00, 78435.50KB/s]
     68%|######7   | 90128/132723 [00:01<00:00, 78931.33KB/s]
     74%|#######3  | 98161/132723 [00:01<00:00, 79352.72KB/s]
     80%|#######9  | 106105/132723 [00:01<00:00, 79361.67KB/s]
     86%|########5 | 114042/132723 [00:01<00:00, 78587.86KB/s]
     92%|#########
 1| 122101/132723 [00:01<00:00, 79184.77KB/s]
     98%|#########8| 130224/132723 [00:01<00:00, 79794.09KB/s]
    100%|##########| 132723/132723 [00:01<00:00, 76502.38KB/s]
+
      0%|          | 0/132723 [00:00<?, ?KB/s]
      4%|4         | 5381/132723 [00:00<00:02, 53800.82KB/s]
     10%|#         | 13400/132723 [00:00<00:01, 69320.29KB/s]
     16%|#5        | 21149/132723 [00:00<00:01, 73047.23KB/s]
     22%|##1       | 28918/132723 [00:00<00:01, 74877.66KB/s]
     28%|##7       | 36781/132723 [00:00<00:01, 76229.48KB/s]
     34%|###3      | 44772/132723 [00:00<00:01, 77478.12KB/s]
     40%|###9      | 52682/132723 [00:00<00:01, 78006.28KB/s]
     46%|####5     | 60598/132723 [00:00<00:00, 78370.48KB/s]
     52%|#####1    | 68447/132723 [00:00<00:00, 78405.94KB/s]
     57%|#####7    | 76288/132723 [00:01<00:00, 76527.01KB/s]
     63%|######3   | 84247/132723 [00:01<00:00, 77447.10KB/s]
     69%|######9   | 92033/132723 [00:01<00:00, 77568.80KB/s]
     75%|#######5  | 99837/132723 [00:01<00:00, 77707.27KB/s]
     81%|########1 | 107669/132723 [00:01<00:00, 77884.99KB/s]
     87%|########7 | 115565/132723 [00:01<00:00, 78204.36KB/s]
     93%|#########
 2| 123426/132723 [00:01<00:00, 78324.28KB/s]
     99%|#########8| 131300/132723 [00:01<00:00, 78447.82KB/s]
    100%|##########| 132723/132723 [00:01<00:00, 76834.91KB/s]
 
 
 
@@ -241,7 +241,7 @@ Display result
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  20.224 seconds)
+   **Total running time of the script:** ( 2 minutes  18.511 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_ssd_gluoncv.py:
diff --git a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
index d32eb3201..8be97f5f2 100644
--- a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
@@ -5,22 +5,22 @@
 
 Computation times
 =================
-**10:44.323** total execution time for **how_to_deploy_models** files:
+**10:35.640** total execution time for **how_to_deploy_models** files:
 
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``) | 02:58.197 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``) | 02:57.913 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)                           | 02:20.224 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)                           | 02:18.511 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)           | 01:57.294 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)           | 01:52.020 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)                               | 01:31.072 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)                               | 01:25.102 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)                         | 01:06.790 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)                         | 01:10.470 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)                 | 00:28.246 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)                 | 00:30.126 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)                       | 00:22.493 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)                       | 00:21.492 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_deploy_models_deploy_sparse.py` (``deploy_sparse.py``)                                     | 00:00.006 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
index d4f236f99..c95f43a4e 100644
--- a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
@@ -476,7 +476,7 @@ First let us define two helper functions to get the mobilenet model and a cat im
 
  .. code-block:: none
 
-    Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip009d8886-d77e-484a-8d9e-3d2749623c9e from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
+    Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip99d12be9-192c-4776-b24a-3784472088b8 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
 
 
 
diff --git a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
index 50b5cc5c2..07417928d 100644
--- a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
@@ -5,14 +5,14 @@
 
 Computation times
 =================
-**00:39.888** total execution time for **how_to_extend_tvm** files:
+**00:39.452** total execution time for **how_to_extend_tvm** files:
 
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``) | 00:36.328 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``) | 00:36.319 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)           | 00:02.637 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)           | 00:02.203 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)                     | 00:00.916 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)                     | 00:00.921 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_low_level_custom_pass.py` (``low_level_custom_pass.py``)       | 00:00.007 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_low_level_custom_pass.py` (``low_level_custom_pass.py``)       | 00:00.008 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
index fa58ae2c1..5520ccf26 100644
--- a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
@@ -216,10 +216,10 @@ profile the execution time of each passes.
  .. code-block:: none
 
     Printing results of timing profile...
-    InferType: 9915us [9915us] (44.98%; 44.98%)
-    FoldScaleAxis: 12127us [6us] (55.02%; 55.02%)
-            FoldConstant: 12120us [2351us] (54.99%; 99.95%)
-                    InferType: 9770us [9770us] (44.32%; 80.61%)
+    InferType: 6763us [6763us] (45.62%; 45.62%)
+    FoldScaleAxis: 8063us [6us] (54.38%; 54.38%)
+            FoldConstant: 8057us [1625us] (54.34%; 99.92%)
+                    InferType: 6432us [6432us] (43.38%; 79.83%)
 
 
 
@@ -258,10 +258,10 @@ Refer to following sections and :py:func:`tvm.instrument.pass_instrument` for th
  .. code-block:: none
 
     Printing results of timing profile...
-    InferType: 9507us [9507us] (44.51%; 44.51%)
-    FoldScaleAxis: 11853us [5us] (55.49%; 55.49%)
-            FoldConstant: 11848us [2500us] (55.47%; 99.95%)
-                    InferType: 9348us [9348us] (43.76%; 78.90%)
+    InferType: 6405us [6405us] (44.86%; 44.86%)
+    FoldScaleAxis: 7871us [5us] (55.14%; 55.14%)
+            FoldConstant: 7867us [1612us] (55.10%; 99.94%)
+                    InferType: 6254us [6254us] (43.81%; 79.50%)
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
index 0edf62f3d..0606d1b03 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
@@ -340,7 +340,7 @@ latency of convolution.
 
  .. code-block:: none
 
-    Convolution: 39.655540 ms
+    Convolution: 40.681556 ms
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
index 39ad34c80..10b756fd4 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
@@ -671,7 +671,7 @@ be able to run on our build server
 
  .. code-block:: none
 
-    conv2d with tensor core: 8.041140 ms
+    conv2d with tensor core: 13.369779 ms
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
index c5abd09ac..7c48ba646 100644
--- a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
@@ -143,8 +143,8 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
 
  .. code-block:: none
 
-    Numpy running time: 0.018982
-    Baseline: 3.371941
+    Numpy running time: 0.018258
+    Baseline: 3.408796
 
 
 
@@ -239,7 +239,7 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
 
  .. code-block:: none
 
-    Opt1: 0.303872
+    Opt1: 0.284209
 
 
 
@@ -342,7 +342,7 @@ In this tutorial, we chose to vectorize the inner loop row data since it is cach
 
  .. code-block:: none
 
-    Opt2: 0.344798
+    Opt2: 0.330278
 
 
 
@@ -438,7 +438,7 @@ the access pattern for A matrix is more cache friendly.
 
  .. code-block:: none
 
-    Opt3: 0.119397
+    Opt3: 0.116985
 
 
 
@@ -563,7 +563,7 @@ flattening.
 
  .. code-block:: none
 
-    Opt4: 0.110414
+    Opt4: 0.110756
 
 
 
@@ -685,7 +685,7 @@ write to C when all the block results are ready.
 
  .. code-block:: none
 
-    Opt5: 0.111262
+    Opt5: 0.111863
 
 
 
@@ -810,7 +810,7 @@ Futhermore, we can also utilize multi-core processors to do the thread-level par
 
  .. code-block:: none
 
-    Opt6: 0.144883
+    Opt6: 0.145240
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
index 39529c6e2..512c13cc3 100644
--- a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
@@ -5,12 +5,12 @@
 
 Computation times
 =================
-**00:34.575** total execution time for **how_to_optimize_operators** files:
+**00:34.270** total execution time for **how_to_optimize_operators** files:
 
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)                       | 00:32.262 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)                       | 00:31.892 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``) | 00:01.291 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``) | 00:01.362 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)             | 00:01.022 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)             | 00:01.015 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
index f251ef665..10a55cc1b 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
@@ -5,18 +5,18 @@
 
 Computation times
 =================
-**05:24.599** total execution time for **how_to_tune_with_autoscheduler** files:
+**05:37.999** total execution time for **how_to_tune_with_autoscheduler** files:
 
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``) | 02:46.458 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``) | 02:57.713 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)             | 01:20.837 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)             | 01:20.722 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)           | 00:43.350 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)           | 00:43.441 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)               | 00:17.110 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)               | 00:18.298 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)           | 00:08.596 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)           | 00:09.001 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)             | 00:08.249 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)             | 00:08.824 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
index f045f3a60..051093cc7 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
@@ -206,13 +206,6 @@ file and apply it.
 
 
 
-.. rst-class:: sphx-glr-script-out
-
- .. code-block:: none
-
-    .T
-
-
 
 
 
@@ -247,53 +240,270 @@ cooperative fetching, unrolling and operator fusion.
                  compute: Buffer(compute_2: Pointer(float32), float32, [25088], [])}
       buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute}
       preflattened_buffer_map = {data_1: data_3: Buffer(data_2, float32, [1, 512, 7, 7], []), kernel_1: kernel_3: Buffer(kernel_2, float32, [512, 512, 3, 3], []), bias_1: bias_3: Buffer(bias_2, float32, [1, 512, 1, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [1, 512, 7, 7], [])} {
-      attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 112;
-      allocate(conv2d_nchw: Pointer(local float32), float32, [7]), storage_scope = local;
-      allocate(pad_temp.shared: Pointer(shared float32), float32, [54]), storage_scope = shared;
-      allocate(kernel.shared: Pointer(shared float32), float32, [576]), storage_scope = shared;
-      attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32 {
-        conv2d_nchw_1: Buffer(conv2d_nchw, float32, [1], [], scope="local", align=4)[0] = 0f32
+      attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 32;
+      allocate(conv2d_nchw: Pointer(local float32), float32, [28]), storage_scope = local;
+      allocate(pad_temp.shared: Pointer(shared float32), float32, [1008]), storage_scope = shared;
+      allocate(kernel.shared: Pointer(shared float32), float32, [768]), storage_scope = shared;
+      attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28 {
+        conv2d_nchw_1: Buffer(conv2d_nchw, float32, [28], [], scope="local", align=64)[0] = 0f32
+        conv2d_nchw_1[7] = 0f32
         conv2d_nchw_1[1] = 0f32
+        conv2d_nchw_1[8] = 0f32
         conv2d_nchw_1[2] = 0f32
+        conv2d_nchw_1[9] = 0f32
         conv2d_nchw_1[3] = 0f32
+        conv2d_nchw_1[10] = 0f32
         conv2d_nchw_1[4] = 0f32
+        conv2d_nchw_1[11] = 0f32
         conv2d_nchw_1[5] = 0f32
+        conv2d_nchw_1[12] = 0f32
         conv2d_nchw_1[6] = 0f32
-        for (rc.outer.outer: int32, 0, 256) {
-          for (ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer: int32, 0, 2) {
-            attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
-            if @tir.likely((((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer*16) + floordiv(threadIdx.x_1, 2)) < 27), dtype=bool) {
-              pad_temp.shared_1: Buffer(pad_temp.shared, float32, [54], [], scope="shared")[((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer*32) + threadIdx.x_1)] = @tir.if_then_else(((((1 <= (floordiv(floormod(((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer*5) + threadIdx.x_1), 27), 9) + floormod(blockIdx.x, 7))) && ((floordiv(floormod(((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer*5) + threadIdx.x_1), 27), 9) + floormod(blockIdx.x, 7)) < 8)) && (1 <= floormod(((ax0.ax1.fused.ax2.fused. [...]
-            }
-          }
-          for (ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer_1: int32, 0, 9) {
-            attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
-            kernel.shared_1: Buffer(kernel.shared, float32, [576], [], scope="shared")[ramp(((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer_1*64) + (threadIdx.x_2*2)), 1, 2)] = kernel[ramp(((((floordiv(blockIdx.x, 7)*147456) + (floordiv(((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer_1*64) + (threadIdx.x_2*2)), 18)*4608)) + (rc.outer.outer*18)) + floormod(((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer_1*10) + (threadIdx.x_2*2)), 18)), 1, 2)]
-          }
-          for (rc.inner: int32, 0, 2) {
-            for (ry.inner: int32, 0, 3) {
-              for (rx.inner: int32, 0, 3) {
-                let cse_var_1: int32 = (((rc.inner*27) + (ry.inner*9)) + rx.inner)
-                 {
-                  conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[cse_var_1]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-                  conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(cse_var_1 + 1)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-                  conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(cse_var_1 + 2)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-                  conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(cse_var_1 + 3)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-                  conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(cse_var_1 + 4)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-                  conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(cse_var_1 + 5)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-                  conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(cse_var_1 + 6)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-                }
+        conv2d_nchw_1[13] = 0f32
+        conv2d_nchw_1[14] = 0f32
+        conv2d_nchw_1[21] = 0f32
+        conv2d_nchw_1[15] = 0f32
+        conv2d_nchw_1[22] = 0f32
+        conv2d_nchw_1[16] = 0f32
+        conv2d_nchw_1[23] = 0f32
+        conv2d_nchw_1[17] = 0f32
+        conv2d_nchw_1[24] = 0f32
+        conv2d_nchw_1[18] = 0f32
+        conv2d_nchw_1[25] = 0f32
+        conv2d_nchw_1[19] = 0f32
+        conv2d_nchw_1[26] = 0f32
+        conv2d_nchw_1[20] = 0f32
+        conv2d_nchw_1[27] = 0f32
+        for (rc.outer.outer: int32, 0, 32) {
+          for (ry.outer.outer: int32, 0, 3) {
+            let cse_var_4: int32 = (rc.outer.outer*784)
+            let cse_var_3: int32 = (ry.outer.outer*7)
+            let cse_var_2: int32 = (rc.outer.outer*144)
+            let cse_var_1: int32 = (ry.outer.outer*3)
+             {
+              attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1: Buffer(pad_temp.shared, float32, [1008], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else((((1 <= (floordiv(threadIdx.x_1, 9) + ry.outer.outer)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 28)] = @tir.if_then_else(((((floordiv((threadIdx.x_1 + 28), 9) + ry.outer.outer) < 8) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 28), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 56)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 56), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 84)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 3), 9)) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 84), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 112)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 112), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 140)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 5), 9)) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 140), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 168)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 168), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 196)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 196), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 224)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) < 8) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 224), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 252)] = @tir.if_then_else((((1 <= (floordiv(threadIdx.x_1, 9) + ry.outer.outer)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) + 188)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 280)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer) < 8) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 280), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 308)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 308), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 336)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 3), 9)) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 336), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 364)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 364), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 5), 9)) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 392), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 420)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 420), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 448)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 448), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 476)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) < 8) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 476), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 504)] = @tir.if_then_else((((1 <= (floordiv(threadIdx.x_1, 9) + ry.outer.outer)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) + 384)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 532)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer) < 8) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 532), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 560)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 560), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 588)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 3), 9)) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 588), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 616)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 616), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 644)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 5), 9)) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 644), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 672)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 672), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 700)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 700), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 728)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) < 8) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 728), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 756)] = @tir.if_then_else((((1 <= (floordiv(threadIdx.x_1, 9) + ry.outer.outer)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) + 580)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer) < 8) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 784), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 812)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 812), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 840)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 3), 9)) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 840), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 868)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 868), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 896)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 5), 9)) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 896), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 924)] = @tir.if_then_else(((((1 <= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 924), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 952)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 952), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              pad_temp.shared_1[(threadIdx.x_1 + 980)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) < 8) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 980), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1: Buffer(kernel.shared, float32, [768], [], scope="shared")[threadIdx.x_2] = kernel[(((((blockIdx.x*73728) + cse_var_2) + (floordiv(threadIdx.x_2, 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 28)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 28), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 28), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 56)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 56), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 8), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 84)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 84), 48)*4608)) + cse_var_2) + (floormod((floordiv(threadIdx.x_2, 3) + 12), 16)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 112)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 112), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 16), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 140)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 140), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 44), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 168)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 168), 48)*4608)) + cse_var_2) + (floormod((floordiv(threadIdx.x_2, 3) + 8), 16)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 196)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 196), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 4), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 224)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 224), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 32), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 252)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 252), 48)*4608)) + cse_var_2) + ((floordiv(threadIdx.x_2, 3) + 4)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 280)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 280), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 40), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 308)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 308), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 20), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 336)] = kernel[((((((blockIdx.x*73728) + cse_var_2) + (floordiv(threadIdx.x_2, 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 32256)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 364)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 364), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 28), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 392), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 8), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 420)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 420), 48)*4608)) + cse_var_2) + (floormod((floordiv(threadIdx.x_2, 3) + 12), 16)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 448)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 448), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 16), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 476)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 476), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 44), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 504)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 504), 48)*4608)) + cse_var_2) + (floormod((floordiv(threadIdx.x_2, 3) + 8), 16)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 532)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 532), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 4), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 560)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 560), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 32), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 588)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 588), 48)*4608)) + cse_var_2) + ((floordiv(threadIdx.x_2, 3) + 4)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 616)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 616), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 40), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 644)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 644), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 20), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 672)] = kernel[((((((blockIdx.x*73728) + cse_var_2) + (floordiv(threadIdx.x_2, 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 64512)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 700)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 700), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 28), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              kernel.shared_1[(threadIdx.x_2 + 728)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 728), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 8), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 28;
+              if @tir.likely((threadIdx.x_2 < 12), dtype=bool) {
+                kernel.shared_1[(threadIdx.x_2 + 756)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 756), 48)*4608)) + cse_var_2) + ((floordiv(threadIdx.x_2, 3) + 12)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+              }
+              for (rc.outer.inner: int32, 0, 16) {
+                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((rc.outer.inner*63) + floormod(threadIdx.x, 7))]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((rc.outer.inner*63) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+                conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+                conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 27)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+                conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 27)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 36)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+                conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 36)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 45)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+                conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 45)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 54)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+                conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 54)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+                conv2d_nchw_1[14] = (conv2d_nchw_1[14] + (pad_temp.shared_1[((rc.outer.inner*63) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+                conv2d_nchw_1[21] = (conv2d_nchw_1[21] + (pad_temp.shared_1[((rc.outer.inner*63) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+                conv2d_nchw_1[15] = (conv2d_nchw_1[15] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+                conv2d_nchw_1[22] = (conv2d_nchw_1[22] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+                conv2d_nchw_1[16] = (conv2d_nchw_1[16] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+                conv2d_nchw_1[23] = (conv2d_nchw_1[23] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+                conv2d_nchw_1[17] = (conv2d_nchw_1[17] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 27)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+                conv2d_nchw_1[24] = (conv2d_nchw_1[24] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 27)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+                conv2d_nchw_1[18] = (conv2d_nchw_1[18] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 36)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+                conv2d_nchw_1[25] = (conv2d_nchw_1[25] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 36)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+                conv2d_nchw_1[19] = (conv2d_nchw_1[19] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 45)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+                conv2d_nchw_1[26] = (conv2d_nchw_1[26] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 45)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+                conv2d_nchw_1[20] = (conv2d_nchw_1[20] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 54)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+                conv2d_nchw_1[27] = (conv2d_nchw_1[27] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 54)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+                conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+                conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 28)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+                conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 28)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 37)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+                conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 37)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 46)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+                conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 46)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 55)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+                conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 55)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+                conv2d_nchw_1[14] = (conv2d_nchw_1[14] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+                conv2d_nchw_1[21] = (conv2d_nchw_1[21] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+                conv2d_nchw_1[15] = (conv2d_nchw_1[15] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+                conv2d_nchw_1[22] = (conv2d_nchw_1[22] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+                conv2d_nchw_1[16] = (conv2d_nchw_1[16] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+                conv2d_nchw_1[23] = (conv2d_nchw_1[23] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+                conv2d_nchw_1[17] = (conv2d_nchw_1[17] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 28)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+                conv2d_nchw_1[24] = (conv2d_nchw_1[24] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 28)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+                conv2d_nchw_1[18] = (conv2d_nchw_1[18] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 37)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+                conv2d_nchw_1[25] = (conv2d_nchw_1[25] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 37)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+                conv2d_nchw_1[19] = (conv2d_nchw_1[19] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 46)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+                conv2d_nchw_1[26] = (conv2d_nchw_1[26] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 46)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+                conv2d_nchw_1[20] = (conv2d_nchw_1[20] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 55)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+                conv2d_nchw_1[27] = (conv2d_nchw_1[27] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 55)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+                conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+                conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+                conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+                conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+                conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+                conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+                conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 29)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+                conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 29)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+                conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 38)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+                conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 38)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+                conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 47)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+                conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 47)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+                conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 56)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+                conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 56)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+                conv2d_nchw_1[14] = (conv2d_nchw_1[14] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+                conv2d_nchw_1[21] = (conv2d_nchw_1[21] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+                conv2d_nchw_1[15] = (conv2d_nchw_1[15] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+                conv2d_nchw_1[22] = (conv2d_nchw_1[22] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+                conv2d_nchw_1[16] = (conv2d_nchw_1[16] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+                conv2d_nchw_1[23] = (conv2d_nchw_1[23] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+                conv2d_nchw_1[17] = (conv2d_nchw_1[17] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 29)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+                conv2d_nchw_1[24] = (conv2d_nchw_1[24] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 29)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+                conv2d_nchw_1[18] = (conv2d_nchw_1[18] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 38)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+                conv2d_nchw_1[25] = (conv2d_nchw_1[25] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 38)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+                conv2d_nchw_1[19] = (conv2d_nchw_1[19] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 47)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+                conv2d_nchw_1[26] = (conv2d_nchw_1[26] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 47)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+                conv2d_nchw_1[20] = (conv2d_nchw_1[20] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 56)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+                conv2d_nchw_1[27] = (conv2d_nchw_1[27] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 56)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
               }
             }
           }
         }
-        compute[(((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7))] = max((conv2d_nchw_1[0] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-        compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 1)] = max((conv2d_nchw_1[1] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-        compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 2)] = max((conv2d_nchw_1[2] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-        compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 3)] = max((conv2d_nchw_1[3] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-        compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 4)] = max((conv2d_nchw_1[4] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-        compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 5)] = max((conv2d_nchw_1[5] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-        compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 6)] = max((conv2d_nchw_1[6] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
+        for (i1.inner: int32, 0, 4) {
+          for (i2.inner: int32, 0, 7) {
+            compute[(((((blockIdx.x*784) + (floordiv(threadIdx.x, 7)*196)) + (i1.inner*49)) + (i2.inner*7)) + floormod(threadIdx.x, 7))] = max((conv2d_nchw_1[((i1.inner*7) + i2.inner)] + bias[(((blockIdx.x*16) + (floordiv(threadIdx.x, 7)*4)) + i1.inner)]), 0f32)
+          }
+        }
       }
     }
 
@@ -347,7 +557,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 0.311 ms
+    Execution time of this operator: 0.355 ms
 
 
 
@@ -395,37 +605,37 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     conv2d_nchw_nn_o_o_i, conv2d_nchw_nn_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_i, factor=1)
     conv2d_nchw_nn_o_o_o_i, conv2d_nchw_nn_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_i, factor=1)
     conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_o_i, factor=1)
-    conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=1)
-    conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=1)
-    conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=32)
+    conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=2)
+    conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=2)
+    conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=4)
     conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=1)
     conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1)
-    conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
+    conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=7)
     conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=1)
     conv2d_nchw_yy_o_o_o_o, conv2d_nchw_yy_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_o_i, factor=1)
     conv2d_nchw_xx_o_i, conv2d_nchw_xx_i = s[conv2d_nchw].split(conv2d_nchw_xx, factor=1)
     conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=1)
-    conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=1)
-    conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=7)
-    conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=2)
-    conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=1)
-    conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=3)
+    conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=7)
+    conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
+    conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=1)
+    conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=16)
+    conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=1)
     conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
-    conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=3)
-    conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1)
+    conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=1)
+    conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=3)
     s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2 [...]
     compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
     compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
     compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
-    compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=1)
-    compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=32)
+    compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=4)
+    compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=4)
     compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
-    compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
+    compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=7)
     compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=1)
     compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
     compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=1)
-    compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)
-    compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=7)
+    compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=7)
+    compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
     s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
     s[conv2d_nchw].compute_at(s[compute], compute_i3_o_i)
     kernel_shared = s.cache_read(kernel, "shared", [conv2d_nchw])
@@ -442,16 +652,16 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i)
     s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis("threadIdx.x"))
     kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
-    kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=2)
+    kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
     s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-    kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=32)
+    kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=28)
     s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
     pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
     pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
     s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-    pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=32)
+    pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=28)
     s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
-    s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 0)
+    s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 1024)
     s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "unroll_explicit", True)
 
     CUDA source code:
@@ -469,49 +679,201 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
       #define int64_t long long
       #define uint64_t unsigned long long
     #endif
-    extern "C" __global__ void __launch_bounds__(32) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
-      float conv2d_nchw[7];
-      __shared__ float pad_temp_shared[54];
-      __shared__ float kernel_shared[576];
+    extern "C" __global__ void __launch_bounds__(28) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
+      float conv2d_nchw[28];
+      __shared__ float pad_temp_shared[1008];
+      __shared__ float kernel_shared[768];
       conv2d_nchw[0] = 0.000000e+00f;
+      conv2d_nchw[7] = 0.000000e+00f;
       conv2d_nchw[1] = 0.000000e+00f;
+      conv2d_nchw[8] = 0.000000e+00f;
       conv2d_nchw[2] = 0.000000e+00f;
+      conv2d_nchw[9] = 0.000000e+00f;
       conv2d_nchw[3] = 0.000000e+00f;
+      conv2d_nchw[10] = 0.000000e+00f;
       conv2d_nchw[4] = 0.000000e+00f;
+      conv2d_nchw[11] = 0.000000e+00f;
       conv2d_nchw[5] = 0.000000e+00f;
+      conv2d_nchw[12] = 0.000000e+00f;
       conv2d_nchw[6] = 0.000000e+00f;
-      for (int rc_outer_outer = 0; rc_outer_outer < 256; ++rc_outer_outer) {
-        __syncthreads();
-        for (int ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer = 0; ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer < 2; ++ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer) {
-          if (((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer * 16) + (((int)threadIdx.x) >> 1)) < 27) {
-            pad_temp_shared[((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer * 32) + ((int)threadIdx.x))] = (((((1 <= (((((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer * 5) + ((int)threadIdx.x)) % 27) / 9) + (((int)blockIdx.x) % 7))) && ((((((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer * 5) + ((int)threadIdx.x)) % 27) / 9) + (((int)blockIdx.x) % 7)) < 8)) && (1 <= (((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer * 5) + ((int)threadIdx.x)) % 9))) && ((((ax0_ax1_fused_ax2_fused_ax3_fused [...]
+      conv2d_nchw[13] = 0.000000e+00f;
+      conv2d_nchw[14] = 0.000000e+00f;
+      conv2d_nchw[21] = 0.000000e+00f;
+      conv2d_nchw[15] = 0.000000e+00f;
+      conv2d_nchw[22] = 0.000000e+00f;
+      conv2d_nchw[16] = 0.000000e+00f;
+      conv2d_nchw[23] = 0.000000e+00f;
+      conv2d_nchw[17] = 0.000000e+00f;
+      conv2d_nchw[24] = 0.000000e+00f;
+      conv2d_nchw[18] = 0.000000e+00f;
+      conv2d_nchw[25] = 0.000000e+00f;
+      conv2d_nchw[19] = 0.000000e+00f;
+      conv2d_nchw[26] = 0.000000e+00f;
+      conv2d_nchw[20] = 0.000000e+00f;
+      conv2d_nchw[27] = 0.000000e+00f;
+      for (int rc_outer_outer = 0; rc_outer_outer < 32; ++rc_outer_outer) {
+        for (int ry_outer_outer = 0; ry_outer_outer < 3; ++ry_outer_outer) {
+          __syncthreads();
+          pad_temp_shared[((int)threadIdx.x)] = ((((1 <= ((((int)threadIdx.x) / 9) + ry_outer_outer)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 28)] = (((((((((int)threadIdx.x) + 28) / 9) + ry_outer_outer) < 8) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 28) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 56)] = (((((1 <= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 56) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 84)] = (((1 <= ((((int)threadIdx.x) + 3) % 9)) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 84) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 112)] = (((((1 <= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 112) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 140)] = (((1 <= ((((int)threadIdx.x) + 5) % 9)) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 140) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 168)] = (((((1 <= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 168) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 196)] = ((((1 <= (((((int)threadIdx.x) + 7) / 9) + ry_outer_outer)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 196) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 224)] = (((((((((int)threadIdx.x) + 35) / 9) + ry_outer_outer) < 8) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 224) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 252)] = ((((1 <= ((((int)threadIdx.x) / 9) + ry_outer_outer)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) + 188)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 280)] = (((((((((int)threadIdx.x) + 28) / 9) + ry_outer_outer) < 8) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 280) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 308)] = (((((1 <= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 308) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 336)] = (((1 <= ((((int)threadIdx.x) + 3) % 9)) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 336) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 364)] = (((((1 <= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 364) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 392)] = (((1 <= ((((int)threadIdx.x) + 5) % 9)) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 392) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 420)] = (((((1 <= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 420) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 448)] = ((((1 <= (((((int)threadIdx.x) + 7) / 9) + ry_outer_outer)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 448) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 476)] = (((((((((int)threadIdx.x) + 35) / 9) + ry_outer_outer) < 8) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 476) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 504)] = ((((1 <= ((((int)threadIdx.x) / 9) + ry_outer_outer)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) + 384)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 532)] = (((((((((int)threadIdx.x) + 28) / 9) + ry_outer_outer) < 8) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 532) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 560)] = (((((1 <= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 560) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 588)] = (((1 <= ((((int)threadIdx.x) + 3) % 9)) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 588) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 616)] = (((((1 <= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 616) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 644)] = (((1 <= ((((int)threadIdx.x) + 5) % 9)) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 644) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 672)] = (((((1 <= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 672) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 700)] = ((((1 <= (((((int)threadIdx.x) + 7) / 9) + ry_outer_outer)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 700) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 728)] = (((((((((int)threadIdx.x) + 35) / 9) + ry_outer_outer) < 8) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 728) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 756)] = ((((1 <= ((((int)threadIdx.x) / 9) + ry_outer_outer)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) + 580)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 784)] = (((((((((int)threadIdx.x) + 28) / 9) + ry_outer_outer) < 8) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 784) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 812)] = (((((1 <= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 812) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 840)] = (((1 <= ((((int)threadIdx.x) + 3) % 9)) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 840) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 868)] = (((((1 <= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 868) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 896)] = (((1 <= ((((int)threadIdx.x) + 5) % 9)) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 896) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 924)] = (((((1 <= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) && (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) < 8)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 924) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 952)] = ((((1 <= (((((int)threadIdx.x) + 7) / 9) + ry_outer_outer)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 952) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 980)] = (((((((((int)threadIdx.x) + 35) / 9) + ry_outer_outer) < 8) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 980) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+          kernel_shared[((int)threadIdx.x)] = kernel[(((((((int)blockIdx.x) * 73728) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 28)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 28) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 28) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 56)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 56) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 8) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 84)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 84) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) / 3) + 12) & 15) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 112)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 112) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 16) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 140)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 140) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 44) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 168)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 168) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) / 3) + 8) & 15) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 196)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 196) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 4) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 224)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 224) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 32) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 252)] = kernel[(((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 252) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 36)];
+          kernel_shared[(((int)threadIdx.x) + 280)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 280) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 40) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 308)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 308) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 20) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 336)] = kernel[((((((((int)blockIdx.x) * 73728) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 32256)];
+          kernel_shared[(((int)threadIdx.x) + 364)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 364) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 28) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 392)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 392) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 8) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 420)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 420) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) / 3) + 12) & 15) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 448)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 448) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 16) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 476)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 476) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 44) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 504)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 504) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) / 3) + 8) & 15) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 532)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 532) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 4) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 560)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 560) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 32) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 588)] = kernel[(((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 588) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 36)];
+          kernel_shared[(((int)threadIdx.x) + 616)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 616) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 40) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 644)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 644) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 20) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 672)] = kernel[((((((((int)blockIdx.x) * 73728) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 64512)];
+          kernel_shared[(((int)threadIdx.x) + 700)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 700) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 28) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+          kernel_shared[(((int)threadIdx.x) + 728)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 728) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 8) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+          if (((int)threadIdx.x) < 12) {
+            kernel_shared[(((int)threadIdx.x) + 756)] = kernel[(((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 756) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 108)];
           }
-        }
-        for (int ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1 = 0; ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1 < 9; ++ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1) {
-          *(float2*)(kernel_shared + ((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1 * 64) + (((int)threadIdx.x) * 2))) = *(float2*)(kernel + (((((((int)blockIdx.x) / 7) * 147456) + ((((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1 * 64) + (((int)threadIdx.x) * 2)) / 18) * 4608)) + (rc_outer_outer * 18)) + (((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1 * 10) + (((int)threadIdx.x) * 2)) % 18)));
-        }
-        __syncthreads();
-        for (int rc_inner = 0; rc_inner < 2; ++rc_inner) {
-          for (int ry_inner = 0; ry_inner < 3; ++ry_inner) {
-            for (int rx_inner = 0; rx_inner < 3; ++rx_inner) {
-              conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_inner * 27) + (ry_inner * 9)) + rx_inner)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-              conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 1)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-              conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 2)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-              conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 3)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-              conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 4)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-              conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 5)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-              conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 6)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-            }
+          __syncthreads();
+          for (int rc_outer_inner = 0; rc_outer_inner < 16; ++rc_outer_inner) {
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((rc_outer_inner * 63) + (((int)threadIdx.x) % 7))] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((rc_outer_inner * 63) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+            conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+            conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 27)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+            conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 27)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 36)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+            conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 36)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 45)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+            conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 45)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 54)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+            conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 54)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+            conv2d_nchw[14] = (conv2d_nchw[14] + (pad_temp_shared[((rc_outer_inner * 63) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+            conv2d_nchw[21] = (conv2d_nchw[21] + (pad_temp_shared[((rc_outer_inner * 63) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+            conv2d_nchw[15] = (conv2d_nchw[15] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+            conv2d_nchw[22] = (conv2d_nchw[22] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+            conv2d_nchw[16] = (conv2d_nchw[16] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+            conv2d_nchw[23] = (conv2d_nchw[23] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+            conv2d_nchw[17] = (conv2d_nchw[17] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 27)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+            conv2d_nchw[24] = (conv2d_nchw[24] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 27)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+            conv2d_nchw[18] = (conv2d_nchw[18] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 36)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+            conv2d_nchw[25] = (conv2d_nchw[25] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 36)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+            conv2d_nchw[19] = (conv2d_nchw[19] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 45)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+            conv2d_nchw[26] = (conv2d_nchw[26] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 45)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+            conv2d_nchw[20] = (conv2d_nchw[20] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 54)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+            conv2d_nchw[27] = (conv2d_nchw[27] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 54)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+            conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+            conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 28)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+            conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 28)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 37)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+            conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 37)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 46)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+            conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 46)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 55)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+            conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 55)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+            conv2d_nchw[14] = (conv2d_nchw[14] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+            conv2d_nchw[21] = (conv2d_nchw[21] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+            conv2d_nchw[15] = (conv2d_nchw[15] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+            conv2d_nchw[22] = (conv2d_nchw[22] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+            conv2d_nchw[16] = (conv2d_nchw[16] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+            conv2d_nchw[23] = (conv2d_nchw[23] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+            conv2d_nchw[17] = (conv2d_nchw[17] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 28)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+            conv2d_nchw[24] = (conv2d_nchw[24] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 28)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+            conv2d_nchw[18] = (conv2d_nchw[18] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 37)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+            conv2d_nchw[25] = (conv2d_nchw[25] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 37)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+            conv2d_nchw[19] = (conv2d_nchw[19] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 46)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+            conv2d_nchw[26] = (conv2d_nchw[26] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 46)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+            conv2d_nchw[20] = (conv2d_nchw[20] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 55)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+            conv2d_nchw[27] = (conv2d_nchw[27] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 55)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+            conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+            conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 29)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+            conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 29)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 38)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+            conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 38)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 47)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+            conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 47)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 56)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+            conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 56)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+            conv2d_nchw[14] = (conv2d_nchw[14] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+            conv2d_nchw[21] = (conv2d_nchw[21] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+            conv2d_nchw[15] = (conv2d_nchw[15] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+            conv2d_nchw[22] = (conv2d_nchw[22] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+            conv2d_nchw[16] = (conv2d_nchw[16] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+            conv2d_nchw[23] = (conv2d_nchw[23] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+            conv2d_nchw[17] = (conv2d_nchw[17] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 29)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+            conv2d_nchw[24] = (conv2d_nchw[24] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 29)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+            conv2d_nchw[18] = (conv2d_nchw[18] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 38)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+            conv2d_nchw[25] = (conv2d_nchw[25] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 38)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+            conv2d_nchw[19] = (conv2d_nchw[19] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 47)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+            conv2d_nchw[26] = (conv2d_nchw[26] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 47)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+            conv2d_nchw[20] = (conv2d_nchw[20] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 56)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+            conv2d_nchw[27] = (conv2d_nchw[27] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 56)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
           }
         }
       }
-      compute[((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7))] = max((conv2d_nchw[0] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-      compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 1)] = max((conv2d_nchw[1] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-      compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 2)] = max((conv2d_nchw[2] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-      compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 3)] = max((conv2d_nchw[3] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-      compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 4)] = max((conv2d_nchw[4] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-      compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 5)] = max((conv2d_nchw[5] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-      compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 6)] = max((conv2d_nchw[6] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
+      for (int i1_inner = 0; i1_inner < 4; ++i1_inner) {
+        for (int i2_inner = 0; i2_inner < 7; ++i2_inner) {
+          compute[(((((((int)blockIdx.x) * 784) + ((((int)threadIdx.x) / 7) * 196)) + (i1_inner * 49)) + (i2_inner * 7)) + (((int)threadIdx.x) % 7))] = max((conv2d_nchw[((i1_inner * 7) + i2_inner)] + bias[(((((int)blockIdx.x) * 16) + ((((int)threadIdx.x) / 7) * 4)) + i1_inner)]), 0.000000e+00f);
+        }
+      }
     }
 
 
@@ -572,7 +934,7 @@ In the example below we resume the status and do more 5 trials.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  46.458 seconds)
+   **Total running time of the script:** ( 2 minutes  57.713 seconds)
 
 
 .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
index 436b6befa..c723fe35b 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
@@ -647,7 +647,7 @@ so we can read the log file and load the best schedules.
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-       9.8330       9.8493       9.8599       9.7896       0.0310   
+       9.7912       9.7954       9.8093       9.7689       0.0167   
                
 
 
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
index c08bc2adf..19466b558 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
@@ -666,7 +666,7 @@ so we can read the log file and load the best schedules.
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      761.5104     760.6213     764.0922     759.8176      1.8549   
+      759.4694     759.6805     760.8415     757.8861      1.2157   
                
 
 
@@ -694,7 +694,7 @@ Other Tips
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  20.837 seconds)
+   **Total running time of the script:** ( 1 minutes  20.722 seconds)
 
 
 .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_x86.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
index 8326752df..faaf7edcc 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
@@ -397,104 +397,32 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
                  placeholder_4: Buffer(placeholder_14: Pointer(float32), float32, [65536], []),
                  compute: Buffer(compute_2: Pointer(float32), float32, [65536], [])}
       buffer_map = {placeholder_5: placeholder, placeholder_6: placeholder_1, placeholder_7: placeholder_2, placeholder_8: placeholder_3, placeholder_9: placeholder_4, compute_1: compute}
-      preflattened_buffer_map = {compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_7: placeholder_15: Buffer(placeholder_12, int32, [4916], []), placeholder_5: placeholder_16: Buffer(placeholder_10, float32, [128, 256], []), placeholder_6: placeholder_17: Buffer(placeholder_11, float32, [4916, 16, 1], []), placeholder_8: placeholder_18: Buffer(placeholder_13, int32, [33], []), placeholder_9: placeholder_19: Buffer(placeholder_14, float32, [128, 512], [])} {
-      for (i0.outer.i1.outer.fused: int32, 0, 64) "parallel" {
-        allocate(compute_4: Pointer(global float32), float32, [1024]), storage_scope = global {
-          for (i.inner.init: int32, 0, 64) {
-            let cse_var_1: int32 = (i.inner.init*16)
-             {
-              compute_5: Buffer(compute_4, float32, [1024], [])[cse_var_1] = 0f32
-              compute_5[(cse_var_1 + 1)] = 0f32
-              compute_5[(cse_var_1 + 2)] = 0f32
-              compute_5[(cse_var_1 + 3)] = 0f32
-              compute_5[(cse_var_1 + 4)] = 0f32
-              compute_5[(cse_var_1 + 5)] = 0f32
-              compute_5[(cse_var_1 + 6)] = 0f32
-              compute_5[(cse_var_1 + 7)] = 0f32
-              compute_5[(cse_var_1 + 8)] = 0f32
-              compute_5[(cse_var_1 + 9)] = 0f32
-              compute_5[(cse_var_1 + 10)] = 0f32
-              compute_5[(cse_var_1 + 11)] = 0f32
-              compute_5[(cse_var_1 + 12)] = 0f32
-              compute_5[(cse_var_1 + 13)] = 0f32
-              compute_5[(cse_var_1 + 14)] = 0f32
-              compute_5[(cse_var_1 + 15)] = 0f32
-            }
-          }
-          for (elem_idx: int32, 0, let cse_var_2: int32 = floormod(i0.outer.i1.outer.fused, 32) in (placeholder_3[(cse_var_2 + 1)] - placeholder_3[cse_var_2])) {
-            for (i.inner: int32, 0, 64) {
-              let cse_var_3: int32 = floormod(i0.outer.i1.outer.fused, 32)
-               {
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_4: int32 = (i.inner*16)
-                  compute_5[cse_var_4] = (compute_5[cse_var_4] + (placeholder_1[((placeholder_3[cse_var_3]*16) + (elem_idx*16))]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_5: int32 = ((i.inner*16) + 1)
-                  compute_5[cse_var_5] = (compute_5[cse_var_5] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 1)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_6: int32 = ((i.inner*16) + 2)
-                  compute_5[cse_var_6] = (compute_5[cse_var_6] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 2)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_7: int32 = ((i.inner*16) + 3)
-                  compute_5[cse_var_7] = (compute_5[cse_var_7] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 3)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_8: int32 = ((i.inner*16) + 4)
-                  compute_5[cse_var_8] = (compute_5[cse_var_8] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 4)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_9: int32 = ((i.inner*16) + 5)
-                  compute_5[cse_var_9] = (compute_5[cse_var_9] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 5)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_10: int32 = ((i.inner*16) + 6)
-                  compute_5[cse_var_10] = (compute_5[cse_var_10] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 6)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_11: int32 = ((i.inner*16) + 7)
-                  compute_5[cse_var_11] = (compute_5[cse_var_11] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 7)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
+      preflattened_buffer_map = {placeholder_5: placeholder_15: Buffer(placeholder_10, float32, [128, 256], []), placeholder_8: placeholder_16: Buffer(placeholder_13, int32, [33], []), placeholder_7: placeholder_17: Buffer(placeholder_12, int32, [4916], []), placeholder_6: placeholder_18: Buffer(placeholder_11, float32, [4916, 16, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_9: placeholder_19: Buffer(placeholder_14, float32, [128, 512], [])} {
+      for (i0.outer.i1.outer.fused: int32, 0, 16) "parallel" {
+        allocate(compute_4: Pointer(global float32), float32, [4096]), storage_scope = global {
+          for (i.outer.inner: int32, 0, 8) {
+            for (nb_j.inner: int32, 0, 2) {
+              for (i.inner.init: int32, 0, 16) {
+                for (j.init: int32, 0, 16) {
+                  compute_5: Buffer(compute_4, float32, [4096], [])[((((i.outer.inner*512) + (i.inner.init*32)) + (nb_j.inner*16)) + j.init)] = 0f32
                 }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_12: int32 = ((i.inner*16) + 8)
-                  compute_5[cse_var_12] = (compute_5[cse_var_12] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 8)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_13: int32 = ((i.inner*16) + 9)
-                  compute_5[cse_var_13] = (compute_5[cse_var_13] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 9)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_14: int32 = ((i.inner*16) + 10)
-                  compute_5[cse_var_14] = (compute_5[cse_var_14] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 10)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_15: int32 = ((i.inner*16) + 11)
-                  compute_5[cse_var_15] = (compute_5[cse_var_15] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 11)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_16: int32 = ((i.inner*16) + 12)
-                  compute_5[cse_var_16] = (compute_5[cse_var_16] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 12)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_17: int32 = ((i.inner*16) + 13)
-                  compute_5[cse_var_17] = (compute_5[cse_var_17] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 13)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_18: int32 = ((i.inner*16) + 14)
-                  compute_5[cse_var_18] = (compute_5[cse_var_18] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 14)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-                }
-                if @tir.likely((elem_idx < (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-                  let cse_var_19: int32 = ((i.inner*16) + 15)
-                  compute_5[cse_var_19] = (compute_5[cse_var_19] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 15)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
+              }
+              for (elem_idx: int32, 0, let cse_var_1: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner) in (placeholder_3[(cse_var_1 + 1)] - placeholder_3[cse_var_1])) {
+                for (i.inner: int32, 0, 16) {
+                  for (j: int32, 0, 16) {
+                    let cse_var_3: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner)
+                    let cse_var_2: int32 = ((((i.outer.inner*512) + (i.inner*32)) + (nb_j.inner*16)) + j)
+                    compute_5[cse_var_2] = (compute_5[cse_var_2] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + j)]*max(placeholder[(((i.outer.inner*4096) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
+                  }
                 }
               }
             }
           }
-          for (i0.inner: int32, 0, 64) {
-            let cse_var_20: int32 = (((floordiv(i0.outer.i1.outer.fused, 32)*32768) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 32)*16))
-            compute[ramp(cse_var_20, 1, 16)] = max((compute_5[ramp((i0.inner*16), 1, 16)] + placeholder_4[ramp(cse_var_20, 1, 16)]), broadcast(0f32, 16))
+          for (i0.inner: int32, 0, 128) {
+            for (i1.inner: int32, 0, 32) {
+              let cse_var_4: int32 = (((i0.inner*512) + (i0.outer.i1.outer.fused*32)) + i1.inner)
+              compute[cse_var_4] = max((compute_5[((i0.inner*32) + i1.inner)] + placeholder_4[cse_var_4]), 0f32)
+            }
           }
         }
       }
@@ -550,7 +478,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 1.849 ms
+    Execution time of this operator: 1.517 ms
 
 
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
index 4b0b2a280..2dd267bba 100644
--- a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
@@ -5,12 +5,12 @@
 
 Computation times
 =================
-**00:43.518** total execution time for **how_to_tune_with_autotvm** files:
+**00:44.271** total execution time for **how_to_tune_with_autotvm** files:
 
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)           | 00:43.483 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)           | 00:44.240 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)               | 00:00.021 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)               | 00:00.016 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_cuda.py` (``tune_relay_cuda.py``)             | 00:00.005 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
index 940b2f6be..e01929833 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
@@ -892,8 +892,8 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 4, 32]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2885496
-    No: 6   GFLOPS: 42.43/42.43     result: MeasureResult(costs=(0.005456700842105263,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6141278743743896, timestamp=1656635456.932159)        [('tile_f', [-1, 1, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3754080
-    No: 7   GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 6   GFLOPS: 109.70/109.70   result: MeasureResult(costs=(0.0021103945208333333,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6624927520751953, timestamp=1656695361.7349436)      [('tile_f', [-1, 1, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3754080
+    No: 7   GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1016,7 +1016,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 16, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 256, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6225319
-    No: 8   GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 8   GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1139,7 +1139,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 1, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 8, 64]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,943546
-    No: 9   GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 9   GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1262,7 +1262,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 16, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 16, 32]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2868708
-    No: 10  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 10  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 142, in build
         res = future.result()
       File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
@@ -1280,7 +1280,7 @@ for this template
     TimeoutError
 
             [('tile_f', [-1, 32, 2, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 4, 2]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4691833
-    No: 11  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 11  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1403,7 +1403,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 2, 64]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,1042124
-    No: 12  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 12  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1526,7 +1526,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 32, 1, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 32, 16]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,10013405
-    No: 13  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 13  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1649,7 +1649,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 8, 8, 2]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 32]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6732082
-    No: 14  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 14  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1772,7 +1772,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 4, 32]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7536735
-    No: 15  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 15  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1895,7 +1895,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 1, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 128, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,482121
-    No: 16  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 16  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -2018,7 +2018,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 1, 16]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 32, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2824525
-    No: 17  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 17  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -2141,7 +2141,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 64, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 8, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4559286
-    No: 18  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 18  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -2264,7 +2264,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 32, 16]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 512]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9677544
-    No: 19  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+    No: 19  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 738, in __call__
         yield remote, remote.load_module(os.path.split(build_result.filename)[1])
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 702, in run_through_rpc
@@ -2352,7 +2352,7 @@ for this template
       15: _PyEval_EvalFrameDefault
       14: 0x0000000000537c30
       13: _PyObject_FastCallKeywords
-      12: 0x00007f03bd9f3fa2
+      12: 0x00007f937a5adfa2
       11: _ctypes_callproc
       10: ffi_call
       9: ffi_call_unix64
@@ -2417,7 +2417,7 @@ for this template
       21: _PyFunction_FastCallKeywords
       20: _PyEval_EvalFrameDefault
       19: _PyFunction_FastCall      [('tile_f', [-1, 8, 2, 16]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6390073
-    No: 20  GFLOPS: 143.77/143.77   result: MeasureResult(costs=(0.00161018918,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4295034408569336, timestamp=1656635483.5630622)      [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
+    No: 20  GFLOPS: 143.57/143.57   result: MeasureResult(costs=(0.00161243103,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.437474250793457, timestamp=1656695388.2929761)       [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
 
 
 
@@ -2474,7 +2474,7 @@ and measure running time.
     Best config:
     [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
     Finish loading 20 records
-    Time cost of this operator: 0.002028
+    Time cost of this operator: 0.002060
 
 
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
index 60965512b..6bd9b0b72 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
@@ -329,10 +329,10 @@ Timing the untuned program
     ########## Build without Autotuning ##########
     Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)  
     ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------  
-    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  312.9     98.726   (1, 2, 10, 10, 3)  2       1        [312.9]           
-    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.074     0.97     (1, 6, 10, 10)     1       1        [3.074]           
-    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.963     0.304    (1, 1, 10, 10, 3)  1       1        [0.963]           
-    Total_time                                    -                                             316.937   -        -                  -       -        -                 
+    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  312.7     98.733   (1, 2, 10, 10, 3)  2       1        [312.7]           
+    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.036     0.959    (1, 6, 10, 10)     1       1        [3.036]           
+    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.977     0.309    (1, 1, 10, 10, 3)  1       1        [0.977]           
+    Total_time                                    -                                             316.713   -        -                  -       -        -                 
 
 
 
@@ -398,10 +398,10 @@ Timing the tuned program
     ########## Build with Autotuning ##########
     Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)  
     ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------  
-    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  79.25     96.609   (1, 6, 10, 10, 1)  2       1        [79.25]           
-    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.798     2.191    (1, 6, 10, 10)     1       1        [1.798]           
-    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.984     1.199    (1, 1, 10, 10, 3)  1       1        [0.984]           
-    Total_time                                    -                                             82.031    -        -                  -       -        -                 
+    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  189.5     98.416   (1, 1, 10, 10, 6)  2       1        [189.5]           
+    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       2.202     1.144    (1, 6, 10, 10)     1       1        [2.202]           
+    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.847     0.44     (1, 3, 10, 10, 1)  1       1        [0.847]           
+    Total_time                                    -                                             192.55    -        -                  -       -        -                 
 
 
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
index e1b465bf7..0262b0236 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
@@ -225,7 +225,7 @@ take about **2 minutes** to download the Stanford Cars, while COCO 2017 validati
  .. code-block:: none
 
 
-    '/tmp/tmpdii8xw_t/images/random'
+    '/tmp/tmplvgsgyrm/images/random'
 
 
 
@@ -325,8 +325,8 @@ objects to other stuff? We can display some examples from our datasets using ``m
 
  .. code-block:: none
 
-    /tmp/tmpdii8xw_t/images/target contains 8144 images
-    /tmp/tmpdii8xw_t/images/random contains 5000 images
+    /tmp/tmplvgsgyrm/images/target contains 8144 images
+    /tmp/tmplvgsgyrm/images/random contains 5000 images
 
 
 
@@ -501,13 +501,13 @@ the time on our validation set).
  .. code-block:: none
 
     Epoch 1/3
-    328/328 - 55s - loss: 0.2267 - accuracy: 0.9217 - val_loss: 0.1547 - val_accuracy: 0.9520
+    328/328 - 55s - loss: 0.2185 - accuracy: 0.9259 - val_loss: 0.1491 - val_accuracy: 0.9588
     Epoch 2/3
-    328/328 - 52s - loss: 0.0968 - accuracy: 0.9640 - val_loss: 0.1113 - val_accuracy: 0.9622
+    328/328 - 52s - loss: 0.0952 - accuracy: 0.9644 - val_loss: 0.1322 - val_accuracy: 0.9585
     Epoch 3/3
-    328/328 - 52s - loss: 0.0648 - accuracy: 0.9764 - val_loss: 0.1232 - val_accuracy: 0.9634
+    328/328 - 53s - loss: 0.0676 - accuracy: 0.9751 - val_loss: 0.1259 - val_accuracy: 0.9603
 
-    <keras.callbacks.History object at 0x7f9561782710>
+    <keras.callbacks.History object at 0x7f74b3d048d0>
 
 
 
@@ -864,7 +864,7 @@ Arduino tutorial for how to do that `on GitHub <https://github.com/guberti/tvm-a
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 4 minutes  58.237 seconds)
+   **Total running time of the script:** ( 4 minutes  50.228 seconds)
 
 
 .. _sphx_glr_download_how_to_work_with_microtvm_micro_train.py:
diff --git a/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
index c05962ace..5b9a19414 100644
--- a/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
@@ -5,14 +5,14 @@
 
 Computation times
 =================
-**05:44.793** total execution time for **how_to_work_with_microtvm** files:
+**05:36.577** total execution time for **how_to_work_with_microtvm** files:
 
 +---------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_microtvm_micro_train.py` (``micro_train.py``)               | 04:58.237 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_train.py` (``micro_train.py``)               | 04:50.228 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_microtvm_micro_autotune.py` (``micro_autotune.py``)         | 00:43.228 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_autotune.py` (``micro_autotune.py``)         | 00:43.073 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_microtvm_micro_tflite.py` (``micro_tflite.py``)             | 00:03.326 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_tflite.py` (``micro_tflite.py``)             | 00:03.274 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_work_with_microtvm_micro_ethosu.py` (``micro_ethosu.py``)             | 00:00.001 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
index a26e67e55..81aac0623 100644
--- a/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
@@ -5,12 +5,12 @@
 
 Computation times
 =================
-**00:11.718** total execution time for **how_to_work_with_relay** files:
+**00:11.371** total execution time for **how_to_work_with_relay** files:
 
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_relay_using_external_lib.py` (``using_external_lib.py``) | 00:09.999 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_relay_using_external_lib.py` (``using_external_lib.py``) | 00:09.841 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_relay_build_gcn.py` (``build_gcn.py``)                   | 00:01.713 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_relay_build_gcn.py` (``build_gcn.py``)                   | 00:01.524 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_work_with_relay_using_relay_viz.py` (``using_relay_viz.py``)       | 00:00.006 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/work_with_schedules/intrin_math.rst.txt b/docs/_sources/how_to/work_with_schedules/intrin_math.rst.txt
index 94abc2dc2..ed161e16e 100644
--- a/docs/_sources/how_to/work_with_schedules/intrin_math.rst.txt
+++ b/docs/_sources/how_to/work_with_schedules/intrin_math.rst.txt
@@ -261,7 +261,7 @@ The following example customizes CUDA lowering rule for :code:`exp`.
  .. code-block:: none
 
 
-    <function my_cuda_math_rule at 0x7f94eb6f1cb0>
+    <function my_cuda_math_rule at 0x7f742b2438c0>
 
 
 
diff --git a/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
index c088ccc94..3df1a17cb 100644
--- a/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
@@ -5,22 +5,22 @@
 
 Computation times
 =================
-**00:04.294** total execution time for **how_to_work_with_schedules** files:
+**00:04.087** total execution time for **how_to_work_with_schedules** files:
 
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_intrin_math.py` (``intrin_math.py``)                 | 00:01.966 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_intrin_math.py` (``intrin_math.py``)                 | 00:01.891 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_tensorize.py` (``tensorize.py``)                     | 00:01.075 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_tensorize.py` (``tensorize.py``)                     | 00:00.976 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_reduction.py` (``reduction.py``)                     | 00:00.550 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_scan.py` (``scan.py``)                               | 00:00.526 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_scan.py` (``scan.py``)                               | 00:00.529 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_reduction.py` (``reduction.py``)                     | 00:00.520 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_work_with_schedules_extern_op.py` (``extern_op.py``)                     | 00:00.099 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_work_with_schedules_schedule_primitives.py` (``schedule_primitives.py``) | 00:00.035 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_tedd.py` (``tedd.py``)                               | 00:00.027 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_tedd.py` (``tedd.py``)                               | 00:00.026 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_tuple_inputs.py` (``tuple_inputs.py``)               | 00:00.014 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_tuple_inputs.py` (``tuple_inputs.py``)               | 00:00.013 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt b/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
index d4d66323e..6eeae7260 100644
--- a/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
+++ b/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
@@ -347,7 +347,7 @@ The importing needs to happen before the tensorized GEMV being executed.
                  C: Buffer(C_2: Pointer(float32), float32, [524288], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C}
       preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 64], []), B_1: B_3: Buffer(B_2, float32, [512, 64], []), C_1: C_3: Buffer(C_2, float32, [1024, 512], [])} {
-      attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmpe_ar4cnv/input0.cc'\nsource_filename = \"/tmp/tmpe_ar4cnv/input0.cc\"\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = alloca float*, align 8\n  %8 = alloca float*, align 8\n  %9 = alloca floa [...]
+      attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmpv2vuyfy1/input0.cc'\nsource_filename = \"/tmp/tmpv2vuyfy1/input0.cc\"\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = alloca float*, align 8\n  %8 = alloca float*, align 8\n  %9 = alloca floa [...]
       for (i, 0, 1024) {
         for (j.outer: int32, 0, 32) {
           @tir.call_extern("gemv_update", @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), C_2, ((i*512) + (j.outer*16)), 16, 2, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), A_2, (i*64), 64, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), B_2, (j.outer*1024), 1024, 1, dtype=handle), 16, 64, 64, dtype=int32)
diff --git a/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
index 5685dc9b5..092fcb2de 100644
--- a/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
@@ -5,10 +5,10 @@
 
 Computation times
 =================
-**00:21.206** total execution time for **topic_vta_tutorials_autotvm** files:
+**00:21.176** total execution time for **topic_vta_tutorials_autotvm** files:
 
 +---------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_relay_vta.py` (``tune_relay_vta.py``) | 00:21.199 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_relay_vta.py` (``tune_relay_vta.py``) | 00:21.169 | 0.0 MB |
 +---------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_alu_vta.py` (``tune_alu_vta.py``)     | 00:00.006 | 0.0 MB |
 +---------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
index 7fffdbef2..91217ab48 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
@@ -291,7 +291,7 @@ The compilation steps are:
       DeprecationWarning,
     /workspace/vta/tutorials/frontend/deploy_classification.py:213: DeprecationWarning: legacy graph executor behavior of producing json / lib / params will be removed in the next release. Please see documents of tvm.contrib.graph_executor.GraphModule for the  new recommended usage.
       relay_prog, target=tvm.target.Target(target, host=env.target_host), params=params
-    resnet18_v1 inference graph built in 22.13s!
+    resnet18_v1 inference graph built in 23.68s!
 
 
 
diff --git a/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
index 70353d055..46822ff2d 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
@@ -335,7 +335,7 @@ The compilation steps are:
       "target_host parameter is going to be deprecated. "
     /workspace/python/tvm/relay/build_module.py:411: DeprecationWarning: Please use input parameter mod (tvm.IRModule) instead of deprecated parameter mod (tvm.relay.function.Function)
       DeprecationWarning,
-    yolov3-tiny inference graph built in 15.55s!
+    yolov3-tiny inference graph built in 16.58s!
 
 
 
diff --git a/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
index bbef83ed1..707a05498 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
@@ -5,10 +5,10 @@
 
 Computation times
 =================
-**01:30.924** total execution time for **topic_vta_tutorials_frontend** files:
+**01:33.232** total execution time for **topic_vta_tutorials_frontend** files:
 
 +------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_detection.py` (``deploy_detection.py``)           | 00:48.270 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_detection.py` (``deploy_detection.py``)           | 00:49.348 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_classification.py` (``deploy_classification.py``) | 00:42.655 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_classification.py` (``deploy_classification.py``) | 00:43.884 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
index 24abfdbfe..d748f5c14 100644
--- a/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
@@ -5,10 +5,10 @@
 
 Computation times
 =================
-**00:03.258** total execution time for **topic_vta_tutorials_optimize** files:
+**00:03.451** total execution time for **topic_vta_tutorials_optimize** files:
 
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_optimize_convolution_opt.py` (``convolution_opt.py``)         | 00:02.838 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_optimize_convolution_opt.py` (``convolution_opt.py``)         | 00:03.057 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_optimize_matrix_multiply_opt.py` (``matrix_multiply_opt.py``) | 00:00.420 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_optimize_matrix_multiply_opt.py` (``matrix_multiply_opt.py``) | 00:00.395 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
index 9223ce881..67047a018 100644
--- a/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
@@ -5,10 +5,10 @@
 
 Computation times
 =================
-**00:00.759** total execution time for **topic_vta_tutorials** files:
+**00:00.694** total execution time for **topic_vta_tutorials** files:
 
 +---------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_matrix_multiply.py` (``matrix_multiply.py``) | 00:00.405 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_matrix_multiply.py` (``matrix_multiply.py``) | 00:00.373 | 0.0 MB |
 +---------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_vta_get_started.py` (``vta_get_started.py``) | 00:00.354 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_vta_get_started.py` (``vta_get_started.py``) | 00:00.320 | 0.0 MB |
 +---------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt b/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
index abb169d86..a2f810b95 100644
--- a/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
+++ b/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
@@ -205,13 +205,6 @@ trials, we can load the best schedule from the log file and apply it.
 
 
 
-.. rst-class:: sphx-glr-script-out
-
- .. code-block:: none
-
-
-    *E
-
 
 
 
@@ -335,7 +328,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 93.756 ms
+    Execution time of this operator: 93.448 ms
 
 
 
@@ -451,11 +444,6 @@ Expression (TE) language that demonstrates how TVM can optimize computational
 operations.
 
 
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 1 minutes  3.740 seconds)
-
-
 .. _sphx_glr_download_tutorial_auto_scheduler_matmul_x86.py:
 
 .. only:: html
diff --git a/docs/_sources/tutorial/autotvm_matmul_x86.rst.txt b/docs/_sources/tutorial/autotvm_matmul_x86.rst.txt
index f83b560c2..ed7396c6e 100644
--- a/docs/_sources/tutorial/autotvm_matmul_x86.rst.txt
+++ b/docs/_sources/tutorial/autotvm_matmul_x86.rst.txt
@@ -462,16 +462,16 @@ reduce variance, we take 5 measurements and average them.
     waiting for device...
     device available
     Get devices for measurement successfully!
-    No: 1   GFLOPS: 10.52/10.52     result: MeasureResult(costs=(0.0255233368,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5395951271057129, timestamp=1656634327.0577953)       [('tile_y', [-1, 1]), ('tile_x', [-1, 256])],None,80
-    No: 2   GFLOPS: 2.91/10.52      result: MeasureResult(costs=(0.0921453436,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6114046573638916, timestamp=1656634328.6957347)       [('tile_y', [-1, 4]), ('tile_x', [-1, 8])],None,32
-    No: 3   GFLOPS: 11.81/11.81     result: MeasureResult(costs=(0.022731147,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5623428821563721, timestamp=1656634329.7482014)        [('tile_y', [-1, 64]), ('tile_x', [-1, 32])],None,56
-    No: 4   GFLOPS: 1.84/11.81      result: MeasureResult(costs=(0.1455941442,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.4478328227996826, timestamp=1656634332.24224) [('tile_y', [-1, 1]), ('tile_x', [-1, 4])],None,20
-    No: 5   GFLOPS: 3.65/11.81      result: MeasureResult(costs=(0.073621501,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.307786464691162, timestamp=1656634333.6774862) [('tile_y', [-1, 256]), ('tile_x', [-1, 16])],None,48
-    No: 6   GFLOPS: 1.75/11.81      result: MeasureResult(costs=(0.1537386102,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.578004837036133, timestamp=1656634336.824599) [('tile_y', [-1, 512]), ('tile_x', [-1, 4])],None,29
-    No: 7   GFLOPS: 0.87/11.81      result: MeasureResult(costs=(0.3075676484,), error_no=MeasureErrorNo.NO_ERROR, all_cost=5.0429277420043945, timestamp=1656634342.430558)        [('tile_y', [-1, 512]), ('tile_x', [-1, 2])],None,19
-    No: 8   GFLOPS: 10.71/11.81     result: MeasureResult(costs=(0.025061531400000003,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.542504072189331, timestamp=1656634342.995255) [('tile_y', [-1, 4]), ('tile_x', [-1, 64])],None,62
-    No: 9   GFLOPS: 1.88/11.81      result: MeasureResult(costs=(0.14242153419999998,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.3780105113983154, timestamp=1656634345.491731) [('tile_y', [-1, 2]), ('tile_x', [-1, 2])],None,11
-    No: 10  GFLOPS: 2.77/11.81      result: MeasureResult(costs=(0.09689358379999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6534321308135986, timestamp=1656634347.2038527)        [('tile_y', [-1, 4]), ('tile_x', [-1, 4])],None,22
+    No: 1   GFLOPS: 10.54/10.54     result: MeasureResult(costs=(0.025478070800000002,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5391578674316406, timestamp=1656694236.2053952)       [('tile_y', [-1, 1]), ('tile_x', [-1, 256])],None,80
+    No: 2   GFLOPS: 2.92/10.54      result: MeasureResult(costs=(0.09179072079999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6077759265899658, timestamp=1656694237.8423743)        [('tile_y', [-1, 4]), ('tile_x', [-1, 8])],None,32
+    No: 3   GFLOPS: 11.83/11.83     result: MeasureResult(costs=(0.0226969984,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5629618167877197, timestamp=1656694238.9021978)       [('tile_y', [-1, 64]), ('tile_x', [-1, 32])],None,56
+    No: 4   GFLOPS: 1.85/11.83      result: MeasureResult(costs=(0.1448239584,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.444626569747925, timestamp=1656694241.902186) [('tile_y', [-1, 1]), ('tile_x', [-1, 4])],None,20
+    No: 5   GFLOPS: 3.68/11.83      result: MeasureResult(costs=(0.07296285659999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.3114027976989746, timestamp=1656694243.3455317)        [('tile_y', [-1, 256]), ('tile_x', [-1, 16])],None,48
+    No: 6   GFLOPS: 1.84/11.83      result: MeasureResult(costs=(0.1456212594,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.498997926712036, timestamp=1656694245.885647) [('tile_y', [-1, 512]), ('tile_x', [-1, 4])],None,29
+    No: 7   GFLOPS: 0.85/11.83      result: MeasureResult(costs=(0.3165628946,), error_no=MeasureErrorNo.NO_ERROR, all_cost=5.185196876525879, timestamp=1656694251.6419764)        [('tile_y', [-1, 512]), ('tile_x', [-1, 2])],None,19
+    No: 8   GFLOPS: 10.60/11.83     result: MeasureResult(costs=(0.025324113199999998,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5541675090789795, timestamp=1656694252.2091339)       [('tile_y', [-1, 4]), ('tile_x', [-1, 64])],None,62
+    No: 9   GFLOPS: 1.91/11.83      result: MeasureResult(costs=(0.14073400200000002,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.350316047668457, timestamp=1656694254.6798334) [('tile_y', [-1, 2]), ('tile_x', [-1, 2])],None,11
+    No: 10  GFLOPS: 2.79/11.83      result: MeasureResult(costs=(0.0961036628,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6410527229309082, timestamp=1656694256.378456)        [('tile_y', [-1, 4]), ('tile_x', [-1, 4])],None,22
 
 
 
diff --git a/docs/_sources/tutorial/autotvm_relay_x86.rst.txt b/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
index 42414192e..d07691540 100644
--- a/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
+++ b/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
@@ -327,7 +327,7 @@ standard deviation.
 
  .. code-block:: none
 
-    {'mean': 496.66298516000097, 'median': 496.44014619998416, 'std': 0.7022675180231087}
+    {'mean': 497.4941768000008, 'median': 497.148796700003, 'std': 1.2604263698723472}
 
 
 
@@ -563,31 +563,31 @@ the tuning data to.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-
    [Task  1/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  1/25]  Current/Best:   17.54/  17.54 GFLOPS | Progress: (4/20) | 6.76 s
    [Task  1/25]  Current/Best:    6.15/  17.54 GFLOPS | Progress: (8/20) | 9.23 s
    [Task  1/25]  Current/Best:   11.51/  22.87 GFLOPS | Progress: (12/20) | 11.69 s
    [Task  1/25]  Current/Best:   16.75/  22.87 GFLOPS | Progress: (16/20) | 13.37 s
    [Task  1/25]  Current/Best:   11.51/  23.47 GFLOPS | Progress: (20/20) | 15.16 s Done.
-
    [Task  2/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  2/25]  Current/Best:   12.26/  12.88 GFLOPS | Progress: (4/20) | 3.94 s
    [Task  2/25]  Current/Best:   14.00/  17.60 GFLOPS | Progress: (8/20) | 5.27 s
    [Task  2/25]  Current/Best:   21.00/  21.00 GFLOPS | Progress: (12/20) | 6.61 s
    [Task  2/25]  Current/Best:   12.28/  21.00 GFLOPS | Progress: (16/20) | 7.86 s
    [Task  2/25]  Current/Best:   19.65/  21.00 GFLOPS | Progress: (20/20) | 9.50 s Done.
-
    [Task  3/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  3/25]  Current/Best:    1.63/  10.52 GFLOPS | Progress: (4/20) | 5.85 s
    [Task  3/25]  Current/Best:   15.58/  16.87 GFLOPS | Progress: (8/20) | 7.76 s
    [Task  3/25]  Current/Best:   14.90/  16.87 GFLOPS | Progress: (12/20) | 9.47 s
    [Task  3/25]  Current/Best:    7.21/  23.80 GFLOPS | Progress: (16/20) | 11.38 s
    [Task  3/25]  Current/Best:   12.66/  23.80 GFLOPS | Progress: (20/20) | 15.97 s Done.
-
    [Task  4/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  4/25]  Current/Best:    9.55/  20.42 GFLOPS | Progress: (4/20) | 2.38 s
    [Task  4/25]  Current/Best:    6.87/  20.42 GFLOPS | Progress: (8/20) | 7.19 s
    [Task  4/25]  Current/Best:   22.26/  22.26 GFLOPS | Progress: (12/20) | 12.23 s
    [Task  4/25]  Current/Best:   17.16/  22.26 GFLOPS | Progress: (16/20) | 14.64 s
    [Task  4/25]  Current/Best:   13.17/  22.26 GFLOPS | Progress: (20/20) | 16.62 s Done.
-
    [Task  5/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  5/25]  Current/Best:    9.59/  10.32 GFLOPS | Progress: (4/20) | 2.59 s
    [Task  5/25]  Current/Best:   11.63/  11.96 GFLOPS | Progress: (8/20) | 4.67 s
    [Task  5/25]  Current/Best:   11.51/  18.04 GFLOPS | Progress: (12/20) | 7.90 s
    [Task  5/25]  Current/Best:   11.61/  22.46 GFLOPS | Progress: (16/20) | 9.36 s
    [Task  5/25]  Current/Best:   11.95/  22.46 GFLOPS | Progress: (20/20) | 11.25 s Done.
-
    [Task  6/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  6/25]  Current/Best:   12.17/  20.66 GFLOPS | Progress: (4/20) | 4.11 s
    [Task  6/25]  Current/Best:   18.94/  20.66 GFLOPS | Progress: (8/20) | 5.87 s
    [Task  6/25]  Current/Best:   13.28/  20.66 GFLOPS | Progress: (12/20) | 7.83 s
    [Task  6/25]  Current/Best:   20.00/  20.66 GFLOPS | Progress: (16/20) | 10.06 s
    [Task  6/25]  Current/Best:    3.73/  20.66 GFLOPS | Progress: (20/20) | 12.58 s Done.
-
    [Task  7/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  7/25]  Current/Best:   11.26/  12.72 GFLOPS | Progress: (4/20) | 3.54 s
    [Task  7/25]  Current/Best:   20.20/  21.01 GFLOPS | Progress: (8/20) | 5.07 s
    [Task  7/25]  Current/Best:   14.97/  21.01 GFLOPS | Progress: (12/20) | 6.99 s
    [Task  7/25]  Current/Best:   12.26/  21.01 GFLOPS | Progress: (16/20) | 9.03 s
    [Task  7/25]  Current/Best:    6.27/  21.70 GFLOPS | Progress: (20/20) | 11.49 s Done.
-
    [Task  8/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  8/25]  Current/Best:    9.82/  14.35 GFLOPS | Progress: (4/20) | 2.93 s
    [Task  8/25]  Current/Best:    9.43/  14.35 GFLOPS | Progress: (8/20) | 8.11 s
    [Task  8/25]  Current/Best:   12.17/  14.35 GFLOPS | Progress: (12/20) | 14.70 s
    [Task  8/25]  Current/Best:   18.78/  18.78 GFLOPS | Progress: (16/20) | 16.83 s
    [Task  8/25]  Current/Best:   19.89/  19.89 GFLOPS | Progress: (20/20) | 24.03 s Done.
-
    [Task  9/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  9/25]  Current/Best:   14.38/  15.87 GFLOPS | Progress: (4/20) | 11.97 s
    [Task  9/25]  Current/Best:   23.39/  23.39 GFLOPS | Progress: (8/20) | 13.81 s
    [Task  9/25]  Current/Best:    8.25/  23.39 GFLOPS | Progress: (12/20) | 16.39 s
    [Task  9/25]  Current/Best:   17.73/  23.39 GFLOPS | Progress: (16/20) | 19.30 s
    [Task  9/25]  Current/Best:    9.05/  23.39 GFLOPS | Progress: (20/20) | 27.97 s
    [Task 10/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 10/25]  Current/Best:   18.25/  18.25 GFLOPS | Progress: (4/20) | 2.58 s
    [Task 10/25]  Current/Best:   15.46/  18.25 GFLOPS | Progress: (8/20) | 4.22 s
    [Task 10/25]  Current/Best:   12.43/  18.95 GFLOPS | Progress: (12/20) | 5.77 s
    [Task 10/25]  Current/Best:   19.22/  20.33 GFLOPS | Progress: (16/20) | 6.88 s
    [Task 10/25]  Current/Best:    8.97/  20.33 GFLOPS | Progress: (20/20
 ) | 8.43 s Done.
-
    [Task 11/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 11/25]  Current/Best:   12.24/  18.03 GFLOPS | Progress: (4/20) | 3.32 s
    [Task 11/25]  Current/Best:   16.90/  18.03 GFLOPS | Progress: (8/20) | 6.15 s
    [Task 11/25]  Current/Best:   18.10/  18.10 GFLOPS | Progress: (12/20) | 8.22 s
    [Task 11/25]  Current/Best:   13.25/  21.09 GFLOPS | Progress: (16/20) | 11.11 s
    [Task 11/25]  Current/Best:   19.44/  21.62 GFLOPS | Progress: (20/20) | 13.22 s Done.
-
    [Task 12/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 12/25]  Current/Best:    7.79/  18.12 GFLOPS | Progress: (4/20) | 5.83 s
    [Task 12/25]  Current/Best:    5.28/  18.12 GFLOPS | Progress: (8/20) | 9.81 s
    [Task 12/25]  Current/Best:   18.91/  18.91 GFLOPS | Progress: (12/20) | 11.83 s
    [Task 12/25]  Current/Best:   15.39/  18.91 GFLOPS | Progress: (16/20) | 14.75 s
    [Task 12/25]  Current/Best:   15.19/  18.91 GFLOPS | Progress: (20/20) | 16.67 s Done.
-
    [Task 13/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 13/25]  Current/Best:    8.69/  17.27 GFLOPS | Progress: (4/20) | 3.76 s
    [Task 13/25]  Current/Best:   16.02/  20.83 GFLOPS | Progress: (8/20) | 6.40 s
    [Task 13/25]  Current/Best:   19.62/  21.72 GFLOPS | Progress: (12/20) | 9.41 s
    [Task 13/25]  Current/Best:   12.28/  21.72 GFLOPS | Progress: (16/20) | 12.89 s
    [Task 13/25]  Current/Best:   18.76/  21.72 GFLOPS | Progress: (20/20) | 15.23 s Done.
-
    [Task 14/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 14/25]  Current/Best:   13.54/  13.54 GFLOPS | Progress: (4/20) | 3.40 s
    [Task 14/25]  Current/Best:    6.09/  13.54 GFLOPS | Progress: (8/20) | 5.57 s
    [Task 14/25]  Current/Best:   21.29/  21.29 GFLOPS | Progress: (12/20) | 8.26 s
    [Task 14/25]  Current/Best:   17.03/  21.29 GFLOPS | Progress: (16/20) | 9.91 s Done.
-
    [Task 14/25]  Current/Best:   17.29/  21.29 GFLOPS | Progress: (20/20) | 11.67 s
    [Task 15/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 15/25]  Current/Best:   16.09/  17.63 GFLOPS | Progress: (4/20) | 2.73 s
    [Task 15/25]  Current/Best:   14.47/  18.09 GFLOPS | Progress: (8/20) | 4.02 s
    [Task 15/25]  Current/Best:   10.39/  22.31 GFLOPS | Progress: (12/20) | 6.41 s
    [Task 15/25]  Current/Best:   20.40/  22.31 GFLOPS | Progress: (16/20) | 10.05 s
    [Task 15/25]  Current/Best:    9.65/  22.31 GFLOPS | Progress: (20/20) | 11.06 s
    [Task 16/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 16/25]  Current/Best:   20.76/  20.76 GFLOPS | Progress: (4/20) | 2.94 s
    [Task 16/25]  Current/Best:    3.04/  20.76 GFLOPS | Progress: (8/20) | 4.54 s
    [Task 16/25]  Current/Best:   19.51/  20.76 GFLOPS | Progress: (12/20) | 5.75 s
    [Task 16/25]  Current/Best:   17.42/  20.76 GFLOPS | Progress: (16/20) 
 | 7.10 s
    [Task 16/25]  Current/Best:    9.98/  21.27 GFLOPS | Progress: (20/20) | 9.25 s Done.
-
    [Task 17/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 17/25]  Current/Best:   12.45/  18.68 GFLOPS | Progress: (4/20) | 4.79 s
    [Task 17/25]  Current/Best:   14.37/  23.33 GFLOPS | Progress: (8/20) | 7.68 s
    [Task 17/25]  Current/Best:   16.74/  23.33 GFLOPS | Progress: (12/20) | 9.73 s
    [Task 17/25]  Current/Best:   16.39/  23.33 GFLOPS | Progress: (16/20) | 11.94 s
    [Task 17/25]  Current/Best:   10.03/  23.33 GFLOPS | Progress: (20/20) | 14.11 s Done.
-
    [Task 18/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 18/25]  Current/Best:   11.21/  17.92 GFLOPS | Progress: (4/20) | 3.81 s
    [Task 18/25]  Current/Best:   10.57/  19.62 GFLOPS | Progress: (8/20) | 7.51 s
    [Task 18/25]  Current/Best:   19.50/  19.62 GFLOPS | Progress: (12/20) | 9.43 s
    [Task 18/25]  Current/Best:   10.07/  19.62 GFLOPS | Progress: (16/20) | 13.35 s
    [Task 18/25]  Current/Best:   20.57/  20.57 GFLOPS | Progress: (20/20) | 14.84 s Done.
-
    [Task 19/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 19/25]  Current/Best:    7.09/  20.38 GFLOPS | Progress: (4/20) | 6.09 s
    [Task 19/25]  Current/Best:    2.61/  20.38 GFLOPS | Progress: (8/20) | 9.38 s
    [Task 19/25]  Current/Best:   20.22/  21.80 GFLOPS | Progress: (12/20) | 12.38 s
    [Task 19/25]  Current/Best:   14.49/  21.80 GFLOPS | Progress: (16/20) | 15.42 s
    [Task 19/25]  Current/Best:    2.70/  23.65 GFLOPS | Progress: (20/20) | 18.23 s Done.
-
    [Task 20/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 20/25]  Current/Best:    8.94/  15.17 GFLOPS | Progress: (4/20) | 3.32 s Done.
+
    [Task  1/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  1/25]  Current/Best:   17.44/  17.44 GFLOPS | Progress: (4/20) | 6.38 s
    [Task  1/25]  Current/Best:    6.16/  17.44 GFLOPS | Progress: (8/20) | 9.30 s
    [Task  1/25]  Current/Best:   11.48/  22.69 GFLOPS | Progress: (12/20) | 11.76 s
    [Task  1/25]  Current/Best:   16.80/  22.83 GFLOPS | Progress: (16/20) | 13.45 s
    [Task  1/25]  Current/Best:   11.51/  23.85 GFLOPS | Progress: (20/20) | 15.19 s Done.
+
    [Task  2/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  2/25]  Current/Best:   12.26/  13.12 GFLOPS | Progress: (4/20) | 3.81 s
    [Task  2/25]  Current/Best:   14.20/  18.42 GFLOPS | Progress: (8/20) | 5.14 s
    [Task  2/25]  Current/Best:   20.89/  20.89 GFLOPS | Progress: (12/20) | 6.49 s
    [Task  2/25]  Current/Best:   12.39/  20.89 GFLOPS | Progress: (16/20) | 7.75 s
    [Task  2/25]  Current/Best:   19.69/  20.89 GFLOPS | Progress: (20/20) | 9.34 s Done.
+
    [Task  3/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  3/25]  Current/Best:    1.63/  10.50 GFLOPS | Progress: (4/20) | 5.86 s
    [Task  3/25]  Current/Best:   15.54/  16.71 GFLOPS | Progress: (8/20) | 7.81 s
    [Task  3/25]  Current/Best:   14.81/  16.71 GFLOPS | Progress: (12/20) | 9.56 s
    [Task  3/25]  Current/Best:    7.13/  23.78 GFLOPS | Progress: (16/20) | 11.49 s
    [Task  3/25]  Current/Best:   12.61/  23.78 GFLOPS | Progress: (20/20) | 16.01 s Done.
+
    [Task  4/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  4/25]  Current/Best:    9.57/  20.37 GFLOPS | Progress: (4/20) | 2.43 s
    [Task  4/25]  Current/Best:    6.29/  20.37 GFLOPS | Progress: (8/20) | 6.85 s
    [Task  4/25]  Current/Best:   22.41/  22.41 GFLOPS | Progress: (12/20) | 11.47 s
    [Task  4/25]  Current/Best:   16.62/  22.41 GFLOPS | Progress: (16/20) | 13.74 s
    [Task  4/25]  Current/Best:   13.12/  22.41 GFLOPS | Progress: (20/20) | 15.75 s Done.
+
    [Task  5/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  5/25]  Current/Best:    9.26/  10.13 GFLOPS | Progress: (4/20) | 2.65 s
    [Task  5/25]  Current/Best:   11.61/  12.69 GFLOPS | Progress: (8/20) | 4.77 s
    [Task  5/25]  Current/Best:   11.22/  18.04 GFLOPS | Progress: (12/20) | 7.87 s
    [Task  5/25]  Current/Best:   11.33/  22.53 GFLOPS | Progress: (16/20) | 9.29 s
    [Task  5/25]  Current/Best:   11.99/  22.53 GFLOPS | Progress: (20/20) | 11.23 s Done.
+
    [Task  6/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  6/25]  Current/Best:   12.17/  20.69 GFLOPS | Progress: (4/20) | 3.95 s
    [Task  6/25]  Current/Best:   18.97/  20.69 GFLOPS | Progress: (8/20) | 5.72 s
    [Task  6/25]  Current/Best:   13.29/  20.69 GFLOPS | Progress: (12/20) | 7.66 s
    [Task  6/25]  Current/Best:   19.97/  20.69 GFLOPS | Progress: (16/20) | 9.95 s
    [Task  6/25]  Current/Best:    3.71/  20.69 GFLOPS | Progress: (20/20) | 12.49 s Done.
+
    [Task  7/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  7/25]  Current/Best:   11.14/  12.81 GFLOPS | Progress: (4/20) | 3.67 s
    [Task  7/25]  Current/Best:   20.23/  20.96 GFLOPS | Progress: (8/20) | 5.18 s
    [Task  7/25]  Current/Best:   15.85/  20.96 GFLOPS | Progress: (12/20) | 7.08 s
    [Task  7/25]  Current/Best:   12.22/  20.96 GFLOPS | Progress: (16/20) | 9.13 s
    [Task  7/25]  Current/Best:    6.27/  21.69 GFLOPS | Progress: (20/20) | 11.61 s Done.
+
    [Task  8/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  8/25]  Current/Best:    9.96/  14.04 GFLOPS | Progress: (4/20) | 2.91 s
    [Task  8/25]  Current/Best:    9.17/  14.04 GFLOPS | Progress: (8/20) | 7.75 s
    [Task  8/25]  Current/Best:   12.83/  14.04 GFLOPS | Progress: (12/20) | 14.00 s
    [Task  8/25]  Current/Best:   18.88/  18.88 GFLOPS | Progress: (16/20) | 16.06 s
    [Task  8/25]  Current/Best:   19.66/  19.66 GFLOPS | Progress: (20/20) | 22.73 s Done.
+
    [Task  9/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  9/25]  Current/Best:   14.32/  15.59 GFLOPS | Progress: (4/20) | 11.97 s
    [Task  9/25]  Current/Best:   23.32/  23.32 GFLOPS | Progress: (8/20) | 13.74 s
    [Task  9/25]  Current/Best:    8.26/  23.32 GFLOPS | Progress: (12/20) | 16.08 s
    [Task  9/25]  Current/Best:   17.99/  23.32 GFLOPS | Progress: (16/20) | 18.79 s
    [Task  9/25]  Current/Best:    8.87/  23.32 GFLOPS | Progress: (20/20) | 26.56 s
    [Task 10/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 10/25]  Current/Best:   18.20/  18.20 GFLOPS | Progress: (4/20) | 2.57 s
    [Task 10/25]  Current/Best:   15.33/  18.20 GFLOPS | Progress: (8/20) | 4.16 s
    [Task 10/25]  Current/Best:   12.50/  18.78 GFLOPS | Progress: (12/20) | 5.71 s
    [Task 10/25]  Current/Best:   19.02/  20.37 GFLOPS | Progress: (16/20) | 6.82 s
    [Task 10/25]  Current/Best:    8.84/  20.37 GFLOPS | Progress: (20/20
 ) | 8.39 s Done.
+
    [Task 11/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 11/25]  Current/Best:   12.26/  18.09 GFLOPS | Progress: (4/20) | 3.31 s
    [Task 11/25]  Current/Best:   16.84/  18.09 GFLOPS | Progress: (8/20) | 6.07 s
    [Task 11/25]  Current/Best:   17.87/  18.09 GFLOPS | Progress: (12/20) | 8.09 s
    [Task 11/25]  Current/Best:   13.40/  21.17 GFLOPS | Progress: (16/20) | 10.81 s
    [Task 11/25]  Current/Best:   19.43/  21.55 GFLOPS | Progress: (20/20) | 12.82 s Done.
+
    [Task 12/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 12/25]  Current/Best:    7.80/  17.95 GFLOPS | Progress: (4/20) | 5.43 s
    [Task 12/25]  Current/Best:    5.19/  17.95 GFLOPS | Progress: (8/20) | 9.16 s
    [Task 12/25]  Current/Best:   18.83/  18.83 GFLOPS | Progress: (12/20) | 11.19 s
    [Task 12/25]  Current/Best:   15.29/  18.83 GFLOPS | Progress: (16/20) | 14.03 s
    [Task 12/25]  Current/Best:   15.11/  18.83 GFLOPS | Progress: (20/20) | 15.96 s Done.
+
    [Task 13/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 13/25]  Current/Best:    8.78/  17.26 GFLOPS | Progress: (4/20) | 3.73 s
    [Task 13/25]  Current/Best:   16.02/  20.76 GFLOPS | Progress: (8/20) | 6.20 s
    [Task 13/25]  Current/Best:   19.50/  21.23 GFLOPS | Progress: (12/20) | 9.09 s
    [Task 13/25]  Current/Best:   12.25/  21.23 GFLOPS | Progress: (16/20) | 12.54 s
    [Task 13/25]  Current/Best:   18.70/  21.23 GFLOPS | Progress: (20/20) | 14.77 s Done.
+
    [Task 14/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 14/25]  Current/Best:   12.27/  13.29 GFLOPS | Progress: (4/20) | 3.41 s
    [Task 14/25]  Current/Best:    6.11/  13.30 GFLOPS | Progress: (8/20) | 5.61 s
    [Task 14/25]  Current/Best:   20.84/  20.84 GFLOPS | Progress: (12/20) | 8.14 s
    [Task 14/25]  Current/Best:   16.53/  20.84 GFLOPS | Progress: (16/20) | 9.78 s Done.
+
    [Task 14/25]  Current/Best:   17.33/  20.84 GFLOPS | Progress: (20/20) | 11.50 s
    [Task 15/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 15/25]  Current/Best:   16.14/  17.67 GFLOPS | Progress: (4/20) | 2.72 s
    [Task 15/25]  Current/Best:   13.31/  18.09 GFLOPS | Progress: (8/20) | 4.02 s
    [Task 15/25]  Current/Best:   10.42/  22.32 GFLOPS | Progress: (12/20) | 6.09 s
    [Task 15/25]  Current/Best:   20.40/  22.32 GFLOPS | Progress: (16/20) | 9.44 s
    [Task 15/25]  Current/Best:    9.71/  22.32 GFLOPS | Progress: (20/20) | 10.46 s
    [Task 16/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 16/25]  Current/Best:   20.68/  20.68 GFLOPS | Progress: (4/20) | 2.93 s
    [Task 16/25]  Current/Best:    3.04/  20.68 GFLOPS | Progress: (8/20) | 4.55 s
    [Task 16/25]  Current/Best:   19.83/  20.68 GFLOPS | Progress: (12/20) | 5.76 s
    [Task 16/25]  Current/Best:   18.23/  20.68 GFLOPS | Progress: (16/20) |
  7.13 s
    [Task 16/25]  Current/Best:    9.95/  22.56 GFLOPS | Progress: (20/20) | 9.16 s Done.
+
    [Task 17/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 17/25]  Current/Best:   12.86/  18.66 GFLOPS | Progress: (4/20) | 4.73 s
    [Task 17/25]  Current/Best:   14.41/  23.28 GFLOPS | Progress: (8/20) | 7.52 s
    [Task 17/25]  Current/Best:   16.88/  23.28 GFLOPS | Progress: (12/20) | 9.57 s
    [Task 17/25]  Current/Best:   16.51/  23.28 GFLOPS | Progress: (16/20) | 11.71 s
    [Task 17/25]  Current/Best:   10.05/  23.28 GFLOPS | Progress: (20/20) | 13.84 s Done.
+
    [Task 18/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 18/25]  Current/Best:   11.23/  17.92 GFLOPS | Progress: (4/20) | 3.74 s
    [Task 18/25]  Current/Best:   10.54/  18.68 GFLOPS | Progress: (8/20) | 7.15 s
    [Task 18/25]  Current/Best:   19.47/  19.47 GFLOPS | Progress: (12/20) | 9.08 s
    [Task 18/25]  Current/Best:   10.07/  19.47 GFLOPS | Progress: (16/20) | 12.71 s
    [Task 18/25]  Current/Best:   20.56/  20.56 GFLOPS | Progress: (20/20) | 14.21 s Done.
+
    [Task 19/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 19/25]  Current/Best:    7.21/  20.33 GFLOPS | Progress: (4/20) | 6.02 s
    [Task 19/25]  Current/Best:    2.61/  20.33 GFLOPS | Progress: (8/20) | 9.27 s
    [Task 19/25]  Current/Best:   19.61/  21.39 GFLOPS | Progress: (12/20) | 12.11 s
    [Task 19/25]  Current/Best:   15.55/  21.39 GFLOPS | Progress: (16/20) | 15.01 s
    [Task 19/25]  Current/Best:    2.70/  23.59 GFLOPS | Progress: (20/20) | 17.78 s Done.
+
    [Task 20/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 20/25]  Current/Best:    9.39/  15.17 GFLOPS | Progress: (4/20) | 3.30 s Done.
      Done.
-
    [Task 20/25]  Current/Best:    9.75/  15.17 GFLOPS | Progress: (8/20) | 6.73 s
    [Task 20/25]  Current/Best:    2.32/  15.17 GFLOPS | Progress: (12/20) | 10.65 s
    [Task 20/25]  Current/Best:   12.56/  15.17 GFLOPS | Progress: (16/20) | 14.58 s
    [Task 20/25]  Current/Best:   12.37/  21.97 GFLOPS | Progress: (20/20) | 16.68 s
    [Task 21/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 21/25]  Current/Best:    6.41/  17.74 GFLOPS | Progress: (4/20) | 3.28 s
    [Task 21/25]  Current/Best:   14.61/  17.74 GFLOPS | Progress: (8/20) | 4.86 s
    [Task 21/25]  Current/Best:    1.61/  17.74 GFLOPS | Progress: (12/20) | 6.99 s
    [Task 21/25]  Current/Best:   17.89/  17.89 GFLOPS | Progress: (16/20) | 10.51 s
    [Task 21/25]  Current/Best:    4.47/  17.89 GFLOPS | Progress: (20/20) | 17.84 s
    [Task 22/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 22/25]  Current/Best:    2.70/  16.97 GFLOPS | Progress: (4/20
 ) | 2.66 s
    [Task 22/25]  Current/Best:    8.64/  21.96 GFLOPS | Progress: (8/20) | 4.65 s
    [Task 22/25]  Current/Best:   20.05/  21.96 GFLOPS | Progress: (12/20) | 7.05 s
    [Task 22/25]  Current/Best:   15.36/  21.96 GFLOPS | Progress: (16/20) | 9.21 s
    [Task 22/25]  Current/Best:   14.24/  21.96 GFLOPS | Progress: (20/20) | 10.94 s Done.
-
    [Task 23/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 23/25]  Current/Best:   17.59/  20.57 GFLOPS | Progress: (4/20) | 3.22 s
    [Task 23/25]  Current/Best:   14.69/  20.57 GFLOPS | Progress: (8/20) | 6.50 s
    [Task 23/25]  Current/Best:   20.95/  21.77 GFLOPS | Progress: (12/20) | 8.34 s
    [Task 23/25]  Current/Best:    6.36/  21.77 GFLOPS | Progress: (16/20) | 15.50 s
    [Task 23/25]  Current/Best:    7.89/  21.77 GFLOPS | Progress: (20/20) | 19.73 s Done.
-
    [Task 24/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 24/25]  Current/Best:    8.37/   8.37 GFLOPS | Progress: (4/20) | 11.78 s
    [Task 24/25]  Current/Best:    2.08/   8.37 GFLOPS | Progress: (8/20) | 22.79 s
    [Task 24/25]  Current/Best:    4.33/   8.37 GFLOPS | Progress: (12/20) | 34.33 s Done.
+
    [Task 20/25]  Current/Best:    9.80/  15.17 GFLOPS | Progress: (8/20) | 6.72 s
    [Task 20/25]  Current/Best:    2.32/  16.57 GFLOPS | Progress: (12/20) | 10.70 s
    [Task 20/25]  Current/Best:   12.34/  16.57 GFLOPS | Progress: (16/20) | 14.49 s
    [Task 20/25]  Current/Best:   13.44/  22.10 GFLOPS | Progress: (20/20) | 16.60 s
    [Task 21/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 21/25]  Current/Best:    6.39/  17.62 GFLOPS | Progress: (4/20) | 3.26 s
    [Task 21/25]  Current/Best:   14.66/  17.62 GFLOPS | Progress: (8/20) | 4.84 s
    [Task 21/25]  Current/Best:    1.61/  17.62 GFLOPS | Progress: (12/20) | 6.98 s
    [Task 21/25]  Current/Best:   18.17/  18.17 GFLOPS | Progress: (16/20) | 10.42 s
    [Task 21/25]  Current/Best:    4.47/  18.17 GFLOPS | Progress: (20/20) | 17.56 s
    [Task 22/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 22/25]  Current/Best:    2.70/  16.97 GFLOPS | Progress: (4/20
 ) | 2.66 s
    [Task 22/25]  Current/Best:    8.66/  21.89 GFLOPS | Progress: (8/20) | 4.58 s
    [Task 22/25]  Current/Best:   20.03/  21.89 GFLOPS | Progress: (12/20) | 6.90 s
    [Task 22/25]  Current/Best:   15.55/  21.89 GFLOPS | Progress: (16/20) | 8.96 s
    [Task 22/25]  Current/Best:   14.12/  21.89 GFLOPS | Progress: (20/20) | 10.63 s Done.
+
    [Task 23/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 23/25]  Current/Best:   17.69/  20.83 GFLOPS | Progress: (4/20) | 3.23 s
    [Task 23/25]  Current/Best:   14.09/  20.83 GFLOPS | Progress: (8/20) | 6.53 s
    [Task 23/25]  Current/Best:   20.97/  21.68 GFLOPS | Progress: (12/20) | 8.35 s
    [Task 23/25]  Current/Best:    6.44/  21.68 GFLOPS | Progress: (16/20) | 15.41 s
    [Task 23/25]  Current/Best:    7.91/  21.68 GFLOPS | Progress: (20/20) | 19.58 s Done.
+
    [Task 24/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 24/25]  Current/Best:    8.25/   8.25 GFLOPS | Progress: (4/20) | 11.77 s
    [Task 24/25]  Current/Best:    3.48/   8.25 GFLOPS | Progress: (8/20) | 23.02 s
    [Task 24/25]  Current/Best:    4.22/   8.25 GFLOPS | Progress: (12/20) | 33.74 s Done.
      Done.
-
    [Task 24/25]  Current/Best:    6.02/   8.74 GFLOPS | Progress: (16/20) | 40.25 s
    [Task 24/25]  Current/Best:    3.31/   9.15 GFLOPS | Progress: (20/20) | 46.22 s Done.
-
    [Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 25/25]  Current/Best:    1.55/   2.77 GFLOPS | Progress: (4/20) | 11.59 s
    [Task 25/25]  Current/Best:    5.81/   7.65 GFLOPS | Progress: (8/20) | 22.88 s
    [Task 25/25]  Current/Best:    5.90/   7.65 GFLOPS | Progress: (12/20) | 34.17 s
    [Task 25/25]  Current/Best:    5.84/   8.38 GFLOPS | Progress: (16/20) | 35.99 s
    [Task 25/25]  Current/Best:    2.88/   8.59 GFLOPS | Progress: (20/20) | 46.65 s
+
    [Task 24/25]  Current/Best:    6.06/   8.88 GFLOPS | Progress: (16/20) | 39.13 s
    [Task 24/25]  Current/Best:    3.34/   8.88 GFLOPS | Progress: (20/20) | 45.02 s Done.
+
    [Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 25/25]  Current/Best:    1.55/   2.87 GFLOPS | Progress: (4/20) | 11.60 s
    [Task 25/25]  Current/Best:    5.73/   7.90 GFLOPS | Progress: (8/20) | 22.90 s
    [Task 25/25]  Current/Best:    5.96/   7.90 GFLOPS | Progress: (12/20) | 34.16 s
    [Task 25/25]  Current/Best:    5.80/   8.57 GFLOPS | Progress: (16/20) | 35.94 s
    [Task 25/25]  Current/Best:    2.86/   9.21 GFLOPS | Progress: (20/20) | 46.61 s
 
 
 
@@ -748,8 +748,8 @@ improvement in comparing the optimized model to the unoptimized model.
 
  .. code-block:: none
 
-    optimized: {'mean': 412.61378658001377, 'median': 412.78359264999835, 'std': 0.6393139363011658}
-    unoptimized: {'mean': 496.66298516000097, 'median': 496.44014619998416, 'std': 0.7022675180231087}
+    optimized: {'mean': 408.9233894200015, 'median': 408.743179150008, 'std': 0.6943345236249036}
+    unoptimized: {'mean': 497.4941768000008, 'median': 497.148796700003, 'std': 1.2604263698723472}
 
 
 
@@ -772,7 +772,7 @@ profiling/benchmarking.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 10 minutes  25.178 seconds)
+   **Total running time of the script:** ( 10 minutes  17.460 seconds)
 
 
 .. _sphx_glr_download_tutorial_autotvm_relay_x86.py:
diff --git a/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt b/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
index 923421641..222eecef3 100644
--- a/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
+++ b/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
@@ -282,7 +282,7 @@ device and returns the measured cost. Network overhead is excluded.
 
  .. code-block:: none
 
-    1.373e-07 secs/op
+    1.31e-07 secs/op
 
 
 
diff --git a/docs/_sources/tutorial/intro_topi.rst.txt b/docs/_sources/tutorial/intro_topi.rst.txt
index d52dcefa1..5f127c051 100644
--- a/docs/_sources/tutorial/intro_topi.rst.txt
+++ b/docs/_sources/tutorial/intro_topi.rst.txt
@@ -263,7 +263,7 @@ As you can see, scheduled stages of computation have been accumulated and we can
 
  .. code-block:: none
 
-    [stage(a, placeholder(a, 0x10ceaad0)), stage(b, placeholder(b, 0x56ccdf0)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min [...]
+    [stage(a, placeholder(a, 0x1176ff40)), stage(b, placeholder(b, 0x229c9870)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(mi [...]
 
 
 
diff --git a/docs/_sources/tutorial/sg_execution_times.rst.txt b/docs/_sources/tutorial/sg_execution_times.rst.txt
index e0e79f1cc..f0eee7c21 100644
--- a/docs/_sources/tutorial/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorial/sg_execution_times.rst.txt
@@ -5,30 +5,30 @@
 
 Computation times
 =================
-**13:24.222** total execution time for **tutorial** files:
+**13:09.209** total execution time for **tutorial** files:
 
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_autotvm_relay_x86.py` (``autotvm_relay_x86.py``)                 | 10:25.178 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_autotvm_relay_x86.py` (``autotvm_relay_x86.py``)                 | 10:17.460 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_auto_scheduler_matmul_x86.py` (``auto_scheduler_matmul_x86.py``) | 01:03.740 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_tensor_expr_get_started.py` (``tensor_expr_get_started.py``)     | 01:00.292 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_tensor_expr_get_started.py` (``tensor_expr_get_started.py``)     | 01:02.036 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_auto_scheduler_matmul_x86.py` (``auto_scheduler_matmul_x86.py``) | 00:58.137 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_relay_quick_start.py` (``relay_quick_start.py``)                 | 00:27.991 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_relay_quick_start.py` (``relay_quick_start.py``)                 | 00:27.733 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_autotvm_matmul_x86.py` (``autotvm_matmul_x86.py``)               | 00:23.609 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_autotvm_matmul_x86.py` (``autotvm_matmul_x86.py``)               | 00:23.625 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_tensor_ir_blitz_course.py` (``tensor_ir_blitz_course.py``)       | 00:00.810 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_tensor_ir_blitz_course.py` (``tensor_ir_blitz_course.py``)       | 00:01.132 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_intro_topi.py` (``intro_topi.py``)                               | 00:00.693 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_intro_topi.py` (``intro_topi.py``)                               | 00:00.685 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_cross_compilation_and_rpc.py` (``cross_compilation_and_rpc.py``) | 00:00.157 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_cross_compilation_and_rpc.py` (``cross_compilation_and_rpc.py``) | 00:00.139 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_introduction.py` (``introduction.py``)                           | 00:00.004 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_introduction.py` (``introduction.py``)                           | 00:00.005 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_tutorial_tvmc_python.py` (``tvmc_python.py``)                             | 00:00.001 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_tvmc_command_line_driver.py` (``tvmc_command_line_driver.py``)   | 00:00.001 | 0.0 MB |
-+------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_tutorial_install.py` (``install.py``)                                     | 00:00.001 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorial_tvmc_command_line_driver.py` (``tvmc_command_line_driver.py``)   | 00:00.001 | 0.0 MB |
++------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/tutorial/tensor_expr_get_started.rst.txt b/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
index 8fb9ad934..0f7394df8 100644
--- a/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
+++ b/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
@@ -302,7 +302,7 @@ helper function to run a profile of the TVM generated code.
  .. code-block:: none
 
     Numpy running time: 0.000008
-    naive: 0.000008
+    naive: 0.000006
 
 
 
@@ -403,7 +403,7 @@ compile and run this new schedule with the parallel operation applied:
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    parallel: 0.000008
+    parallel: 0.000006
 
 
 
@@ -512,10 +512,10 @@ We can now compare the different schedules
  .. code-block:: none
 
                 Operator                  Timing             Performance
-                   numpy    7.98896000105742e-06                     1.0
-                   naive               7.532e-06      0.9428010653455597
-                parallel              7.9626e-06      0.9967004464844071
-                  vector    2.4634800000000003e-05    3.0836053750099306
+                   numpy    8.199280000553699e-06                    1.0
+                   naive              5.8584e-06      0.7145017610819951
+                parallel    6.0600000000000004e-06    0.7390892858386063
+                  vector             2.46396e-05       3.005093129925565
 
 
 
@@ -936,7 +936,7 @@ matrix multiplication.
 
  .. code-block:: none
 
-    Numpy running time: 0.018837
+    Numpy running time: 0.017751
 
 
 
@@ -996,7 +996,7 @@ optimizations.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    none: 3.508006
+    none: 3.416307
 
 
 
@@ -1101,7 +1101,7 @@ schedule.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    blocking: 0.293005
+    blocking: 0.279631
 
 
 
@@ -1199,7 +1199,7 @@ already cache friendly from our previous optimizations.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    vectorization: 0.328433
+    vectorization: 0.318937
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1275,7 +1275,7 @@ more cache friendly.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    loop permutation: 0.119415
+    loop permutation: 0.116747
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1376,7 +1376,7 @@ optimized schedule.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    array packing: 0.110499
+    array packing: 0.108995
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1471,7 +1471,7 @@ to `C` when all the block results are ready.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    block caching: 0.111033
+    block caching: 0.110385
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1559,7 +1559,7 @@ of thread-level parallelization.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    parallelization: 0.145281
+    parallelization: 0.142919
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1640,13 +1640,13 @@ working, we can compare the results.
  .. code-block:: none
 
                 Operator                  Timing             Performance
-                    none            3.5080056702                     1.0
-                blocking            0.2930054273     0.08352478725705567
-           vectorization            0.3284332889     0.09362393330489549
-        loop permutation     0.11941494689999999    0.034040693809138524
-           array packing            0.1104993643    0.031499197746080086
-           block caching     0.11103318059999998    0.031651368623263856
-         parallelization            0.1452810395    0.041414140442856566
+                    none             3.416307389                     1.0
+                blocking            0.2796306091     0.08185171217331579
+           vectorization            0.3189368985      0.0933572018510188
+        loop permutation     0.11674748589999999      0.0341735893777912
+           array packing     0.10899485709999998     0.03190428866294267
+           block caching     0.11038474670000001     0.03231112840004457
+         parallelization     0.14291868489999998    0.041834258053059514
 
 
 
@@ -1688,7 +1688,7 @@ the computation for specific platforms.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  2.036 seconds)
+   **Total running time of the script:** ( 1 minutes  0.292 seconds)
 
 
 .. _sphx_glr_download_tutorial_tensor_expr_get_started.py:
diff --git a/docs/commit_hash b/docs/commit_hash
index b21d8604e..e033d4bcd 100644
--- a/docs/commit_hash
+++ b/docs/commit_hash
@@ -1 +1 @@
-beea0d2d6add545bd27130309aa12b8e7a38100f
+0ae3f5d6ce6f150dd038f59429ab1da4fadea177
diff --git a/docs/contribute/ci.html b/docs/contribute/ci.html
index 7a857982d..bf1d41423 100644
--- a/docs/contribute/ci.html
+++ b/docs/contribute/ci.html
@@ -237,6 +237,8 @@
 <li class="toctree-l4"><a class="reference internal" href="#procedures-for-keeping-ci-green">Procedures for Keeping CI Green</a></li>
 <li class="toctree-l4"><a class="reference internal" href="#dealing-with-flakiness">Dealing with Flakiness</a></li>
 <li class="toctree-l4"><a class="reference internal" href="#skipping-ci">Skipping CI</a></li>
+<li class="toctree-l4"><a class="reference internal" href="#docker-images">Docker Images</a></li>
+<li class="toctree-l4"><a class="reference internal" href="#ci-docker-staging"><code class="docutils literal notranslate"><span class="pre">ci-docker-staging</span></code></a></li>
 <li class="toctree-l4"><a class="reference internal" href="#ci-monitoring-rotation">CI Monitoring Rotation</a></li>
 </ul>
 </li>
@@ -354,27 +356,29 @@
 <span id="ci-guide"></span><h1>Using TVM’s CI<a class="headerlink" href="#using-tvm-s-ci" title="Permalink to this headline">¶</a></h1>
 <div class="contents local topic" id="contents">
 <ul class="simple">
-<li><p><a class="reference internal" href="#for-contributors" id="id2">For Contributors</a></p>
+<li><p><a class="reference internal" href="#for-contributors" id="id3">For Contributors</a></p>
 <ul>
-<li><p><a class="reference internal" href="#debugging-failures" id="id3">Debugging Failures</a></p>
+<li><p><a class="reference internal" href="#debugging-failures" id="id4">Debugging Failures</a></p>
 <ul>
-<li><p><a class="reference internal" href="#jenkins-logs" id="id4">Jenkins Logs</a></p></li>
-<li><p><a class="reference internal" href="#reproduce-failures" id="id5">Reproduce Failures</a></p></li>
+<li><p><a class="reference internal" href="#jenkins-logs" id="id5">Jenkins Logs</a></p></li>
+<li><p><a class="reference internal" href="#reproduce-failures" id="id6">Reproduce Failures</a></p></li>
 </ul>
 </li>
-<li><p><a class="reference internal" href="#reporting-issues" id="id6">Reporting Issues</a></p></li>
+<li><p><a class="reference internal" href="#reporting-issues" id="id7">Reporting Issues</a></p></li>
 </ul>
 </li>
-<li><p><a class="reference internal" href="#for-maintainers" id="id7">For Maintainers</a></p>
+<li><p><a class="reference internal" href="#for-maintainers" id="id8">For Maintainers</a></p>
 <ul>
-<li><p><a class="reference internal" href="#procedures-for-keeping-ci-green" id="id8">Procedures for Keeping CI Green</a></p>
+<li><p><a class="reference internal" href="#procedures-for-keeping-ci-green" id="id9">Procedures for Keeping CI Green</a></p>
 <ul>
-<li><p><a class="reference internal" href="#broken-ci-due-to-simultaneous-merge" id="id9">Broken CI due to Simultaneous Merge</a></p></li>
+<li><p><a class="reference internal" href="#broken-ci-due-to-simultaneous-merge" id="id10">Broken CI due to Simultaneous Merge</a></p></li>
 </ul>
 </li>
-<li><p><a class="reference internal" href="#dealing-with-flakiness" id="id10">Dealing with Flakiness</a></p></li>
-<li><p><a class="reference internal" href="#skipping-ci" id="id11">Skipping CI</a></p></li>
-<li><p><a class="reference internal" href="#ci-monitoring-rotation" id="id12">CI Monitoring Rotation</a></p></li>
+<li><p><a class="reference internal" href="#dealing-with-flakiness" id="id11">Dealing with Flakiness</a></p></li>
+<li><p><a class="reference internal" href="#skipping-ci" id="id12">Skipping CI</a></p></li>
+<li><p><a class="reference internal" href="#docker-images" id="id13">Docker Images</a></p></li>
+<li><p><a class="reference internal" href="#ci-docker-staging" id="id14"><code class="docutils literal notranslate"><span class="pre">ci-docker-staging</span></code></a></p></li>
+<li><p><a class="reference internal" href="#ci-monitoring-rotation" id="id15">CI Monitoring Rotation</a></p></li>
 </ul>
 </li>
 </ul>
@@ -386,19 +390,19 @@ build configuration specified in a <a class="reference external" href="https://g
 Jenkins is the only CI step that is codified to block merging. TVM is also tested minimally
 against Windows and MacOS using GitHub Actions.</p>
 <p>This page describes how contributors and committers can use TVM’s CI to verify their code. You can
-read more about the design of TVM CI in the</p>
+read more about the design of TVM CI in the <a class="reference external" href="https://github.com/tlc-pack/ci">tlc-pack/ci</a> repo.</p>
 <div class="section" id="for-contributors">
-<h2><a class="toc-backref" href="#id2">For Contributors</a><a class="headerlink" href="#for-contributors" title="Permalink to this headline">¶</a></h2>
+<h2><a class="toc-backref" href="#id3">For Contributors</a><a class="headerlink" href="#for-contributors" title="Permalink to this headline">¶</a></h2>
 <p>A standard CI run looks something like this viewed in <a class="reference external" href="https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/activity">Jenkins’ BlueOcean viewer</a>.
 CI runs usually take a couple hours to complete and pull requests (PRs) cannot be merged before CI
 has successfully completed. To diagnose failing steps, click through to the failing
 pipeline stage then to the failing step to see the output logs.</p>
 <a class="reference internal image-reference" href="https://github.com/tlc-pack/web-data/raw/main/images/contribute/ci.png"><img alt="The Jenkins UI for a CI run" src="https://github.com/tlc-pack/web-data/raw/main/images/contribute/ci.png" style="width: 800px;" /></a>
 <div class="section" id="debugging-failures">
-<h3><a class="toc-backref" href="#id3">Debugging Failures</a><a class="headerlink" href="#debugging-failures" title="Permalink to this headline">¶</a></h3>
+<h3><a class="toc-backref" href="#id4">Debugging Failures</a><a class="headerlink" href="#debugging-failures" title="Permalink to this headline">¶</a></h3>
 <p>When CI fails for some reason, there are several methods to diagnose the issue.</p>
 <div class="section" id="jenkins-logs">
-<h4><a class="toc-backref" href="#id4">Jenkins Logs</a><a class="headerlink" href="#jenkins-logs" title="Permalink to this headline">¶</a></h4>
+<h4><a class="toc-backref" href="#id5">Jenkins Logs</a><a class="headerlink" href="#jenkins-logs" title="Permalink to this headline">¶</a></h4>
 <p>The first place to look for a failure is in the CI logs, follow the red Xs on
 the failing job to view the logs. Note:</p>
 <ul class="simple">
@@ -409,24 +413,24 @@ need to scroll up to view the actual failure.</p></li>
 </ul>
 </div>
 <div class="section" id="reproduce-failures">
-<h4><a class="toc-backref" href="#id5">Reproduce Failures</a><a class="headerlink" href="#reproduce-failures" title="Permalink to this headline">¶</a></h4>
+<h4><a class="toc-backref" href="#id6">Reproduce Failures</a><a class="headerlink" href="#reproduce-failures" title="Permalink to this headline">¶</a></h4>
 <p>Most TVM Python tests run under <a class="reference external" href="https://docs.pytest.org/en/6.2.x/"><code class="docutils literal notranslate"><span class="pre">pytest</span></code></a> and can be run as described in <a class="reference internal" href="pull_request.html#pr-testing"><span class="std std-ref">Testing</span></a>.</p>
 </div>
 </div>
 <div class="section" id="reporting-issues">
-<h3><a class="toc-backref" href="#id6">Reporting Issues</a><a class="headerlink" href="#reporting-issues" title="Permalink to this headline">¶</a></h3>
+<h3><a class="toc-backref" href="#id7">Reporting Issues</a><a class="headerlink" href="#reporting-issues" title="Permalink to this headline">¶</a></h3>
 <p>Issues with CI should be <a class="reference external" href="https://github.com/apache/tvm/issues/new?assignees=&amp;labels=&amp;template=ci-problem.md&amp;title=%5BCI+Problem%5D+">reported on GitHub</a>
 with a link to the relevant jobs, commits, or PRs.</p>
 </div>
 </div>
 <div class="section" id="for-maintainers">
-<h2><a class="toc-backref" href="#id7">For Maintainers</a><a class="headerlink" href="#for-maintainers" title="Permalink to this headline">¶</a></h2>
+<h2><a class="toc-backref" href="#id8">For Maintainers</a><a class="headerlink" href="#for-maintainers" title="Permalink to this headline">¶</a></h2>
 <p>This section discusses processes ran by TVM Maintainers.</p>
 <div class="section" id="procedures-for-keeping-ci-green">
-<h3><a class="toc-backref" href="#id8">Procedures for Keeping CI Green</a><a class="headerlink" href="#procedures-for-keeping-ci-green" title="Permalink to this headline">¶</a></h3>
+<h3><a class="toc-backref" href="#id9">Procedures for Keeping CI Green</a><a class="headerlink" href="#procedures-for-keeping-ci-green" title="Permalink to this headline">¶</a></h3>
 <p>This section talks about common procedures used to keep CI passing.</p>
 <div class="section" id="broken-ci-due-to-simultaneous-merge">
-<h4><a class="toc-backref" href="#id9">Broken CI due to Simultaneous Merge</a><a class="headerlink" href="#broken-ci-due-to-simultaneous-merge" title="Permalink to this headline">¶</a></h4>
+<h4><a class="toc-backref" href="#id10">Broken CI due to Simultaneous Merge</a><a class="headerlink" href="#broken-ci-due-to-simultaneous-merge" title="Permalink to this headline">¶</a></h4>
 <p>Developers rely on the TVM CI to get signal on their PRs before merging.  Occasionally, two
 different PRs can pass CI individually but break <code class="docutils literal notranslate"><span class="pre">main</span></code> when both land.  This in turn causes an
 error to show up on an unrelated PR that is based on the broken commit(s). Broken commits can be
@@ -442,7 +446,7 @@ author of the offending PR when that PR is large.</p>
 </div>
 </div>
 <div class="section" id="dealing-with-flakiness">
-<h3><a class="toc-backref" href="#id10">Dealing with Flakiness</a><a class="headerlink" href="#dealing-with-flakiness" title="Permalink to this headline">¶</a></h3>
+<h3><a class="toc-backref" href="#id11">Dealing with Flakiness</a><a class="headerlink" href="#dealing-with-flakiness" title="Permalink to this headline">¶</a></h3>
 <p>If you notice a failure on your PR that seems unrelated to your change, you should
 search [recent GitHub issues related to flaky tests](<a class="reference external" href="https://github.com/apache/tvm/issues?q=is%3Aissue+%5BCI+Problem%5D+Flaky+">https://github.com/apache/tvm/issues?q=is%3Aissue+%5BCI+Problem%5D+Flaky+</a>&gt;) and
 [file a new issue](<a class="reference external" href="https://github.com/apache/tvm/issues/new?assignees=&amp;labels=&amp;template=ci-problem.md&amp;title=%5BCI+Problem%5D+">https://github.com/apache/tvm/issues/new?assignees=&amp;labels=&amp;template=ci-problem.md&amp;title=%5BCI+Problem%5D+</a>&gt;)
@@ -466,7 +470,7 @@ gh pr create
 </div>
 </div>
 <div class="section" id="skipping-ci">
-<h3><a class="toc-backref" href="#id11">Skipping CI</a><a class="headerlink" href="#skipping-ci" title="Permalink to this headline">¶</a></h3>
+<h3><a class="toc-backref" href="#id12">Skipping CI</a><a class="headerlink" href="#skipping-ci" title="Permalink to this headline">¶</a></h3>
 <p>For reverts and trivial forward fixes, adding <code class="docutils literal notranslate"><span class="pre">[skip</span> <span class="pre">ci]</span></code> to the revert’s
 PR title will cause CI to shortcut and only run lint. Committers should
 take care that they only merge CI-skipped PRs to fix a failure on <code class="docutils literal notranslate"><span class="pre">main</span></code> and
@@ -488,8 +492,29 @@ git push my_repo
 </pre></div>
 </div>
 </div>
+<div class="section" id="docker-images">
+<h3><a class="toc-backref" href="#id13">Docker Images</a><a class="headerlink" href="#docker-images" title="Permalink to this headline">¶</a></h3>
+<p>Each CI job runs most of its work inside a Docker container, built from files
+in the <a class="reference external" href="https://github.com/apache/tvm/tree/main/docker">docker/</a> folder. These
+files are built nightly in Jenkins via the <a class="reference external" href="https://ci.tlcpack.ai/job/docker-images-ci/">docker-images-ci</a> job.
+The images for these containers are hosted in the <a class="reference external" href="https://hub.docker.com/u/tlcpack">tlcpack Docker Hub</a>
+and referenced in the <a class="reference external" href="https://github.com/apache/tvm/tree/main/Jenkinsfile.j2">Jenkinsfile.j2</a>. These can be inspected and run
+locally via standard Docker commands.</p>
+</div>
+<div class="section" id="ci-docker-staging">
+<h3><a class="toc-backref" href="#id14"><code class="docutils literal notranslate"><span class="pre">ci-docker-staging</span></code></a><a class="headerlink" href="#ci-docker-staging" title="Permalink to this headline">¶</a></h3>
+<p>The <a class="reference external" href="https://github.com/apache/tvm/tree/ci-docker-staging">ci-docker-staging</a>
+branch is typically used to test updates to Docker images and <code class="docutils literal notranslate"><span class="pre">Jenkinsfile</span></code> changes. When
+running a build for a normal PR from a forked repository, Jenkins uses the code
+from the PR except for the <code class="docutils literal notranslate"><span class="pre">Jenkinsfile</span></code> itself, which comes from the base branch.
+When branches are built, the <code class="docutils literal notranslate"><span class="pre">Jenkinsfile</span></code> in the branch is used, so a committer
+with write access must push PRs to a branch in apache/tvm to properly test
+<code class="docutils literal notranslate"><span class="pre">Jenkinsfile</span></code> changes. If your PR makes changes to the <code class="docutils literal notranslate"><span class="pre">Jenkinsfile</span></code>, make sure
+to &#64; a <a class="reference external" href="https://github.com/apache/tvm/tree/main/CONTRIBUTORS.md">committer</a>
+and ask them to push your PR as a branch to test the changes.</p>
+</div>
 <div class="section" id="ci-monitoring-rotation">
-<h3><a class="toc-backref" href="#id12">CI Monitoring Rotation</a><a class="headerlink" href="#ci-monitoring-rotation" title="Permalink to this headline">¶</a></h3>
+<h3><a class="toc-backref" href="#id15">CI Monitoring Rotation</a><a class="headerlink" href="#ci-monitoring-rotation" title="Permalink to this headline">¶</a></h3>
 <p>Some tests are also flaky and occasionally fail for reasons unrelated to the PR. The
 <a class="reference external" href="https://github.com/apache/tvm/wiki/CI-Monitoring-Runbook">CI monitoring rotation</a> watches for these failures and
 disables tests as necessary. It is the responsibility of those who wrote the test to ultimately fix
diff --git a/docs/how_to/compile_models/from_darknet.html b/docs/how_to/compile_models/from_darknet.html
index e2a268078..7cb8d9415 100644
--- a/docs/how_to/compile_models/from_darknet.html
+++ b/docs/how_to/compile_models/from_darknet.html
@@ -569,7 +569,6 @@ class:[&#39;truck 0.9266&#39;] left:471 top:83 right:689 bottom:169
 class:[&#39;bicycle 0.9984&#39;] left:111 top:113 right:577 bottom:447
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  2.506 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-compile-models-from-darknet-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7716f96385bd5abb6e822041e285be54/from_darknet.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">from_darknet.py</span></code></a></p>
diff --git a/docs/how_to/compile_models/from_mxnet.html b/docs/how_to/compile_models/from_mxnet.html
index 7e37d2f91..8c52b26ad 100644
--- a/docs/how_to/compile_models/from_mxnet.html
+++ b/docs/how_to/compile_models/from_mxnet.html
@@ -422,7 +422,7 @@ to download the full example code</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;x&quot;</span><span class="p">,</span> <a href="https://docs.python.org/3/library/stdtypes.html#tuple" title="builtins.tuple" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">x</span><span class="o">.</span><span class="n">shape</span></a><span class="p">)</span>
 </pre></div>
 </div>
-<img src="../../_images/sphx_glr_from_mxnet_001.png" srcset="../../_images/sphx_glr_from_mxnet_001.png" alt="from mxnet" class = "sphx-glr-single-img"/><div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip8a204fab-2757-45a3-a387-97883374e7a8 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
+<img src="../../_images/sphx_glr_from_mxnet_001.png" srcset="../../_images/sphx_glr_from_mxnet_001.png" alt="from mxnet" class = "sphx-glr-single-img"/><div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip1e8cea2c-47ab-462c-b3c7-021f6d7fcddc from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
 x (1, 3, 224, 224)
 </pre></div>
 </div>
diff --git a/docs/how_to/compile_models/from_oneflow.html b/docs/how_to/compile_models/from_oneflow.html
index 7abe78892..e59d793d1 100644
--- a/docs/how_to/compile_models/from_oneflow.html
+++ b/docs/how_to/compile_models/from_oneflow.html
@@ -427,12 +427,12 @@ python3 -m pip install -f https://release.oneflow.info <span class="nv">oneflow<
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: &quot;https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip&quot; to /workspace/.oneflow/flowvision_cache/resnet18.zip
 
   0%|          | 0.00/41.5M [00:00&lt;?, ?B/s]
- 19%|#9        | 7.99M/41.5M [00:00&lt;00:00, 61.4MB/s]
- 39%|###8      | 16.0M/41.5M [00:00&lt;00:00, 58.7MB/s]
- 58%|#####7    | 24.0M/41.5M [00:00&lt;00:00, 53.9MB/s]
- 76%|#######6  | 31.7M/41.5M [00:00&lt;00:00, 56.5MB/s]
- 90%|########9 | 37.2M/41.5M [00:00&lt;00:00, 53.6MB/s]
-100%|##########| 41.5M/41.5M [00:00&lt;00:00, 50.4MB/s]
+ 19%|#9        | 7.99M/41.5M [00:00&lt;00:00, 52.5MB/s]
+ 39%|###8      | 16.0M/41.5M [00:00&lt;00:00, 53.4MB/s]
+ 54%|#####3    | 22.3M/41.5M [00:00&lt;00:00, 50.4MB/s]
+ 65%|######5   | 27.1M/41.5M [00:00&lt;00:00, 47.8MB/s]
+ 92%|#########2| 38.3M/41.5M [00:00&lt;00:00, 52.9MB/s]
+100%|##########| 41.5M/41.5M [00:00&lt;00:00, 53.0MB/s]
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/compile_models/from_pytorch.html b/docs/how_to/compile_models/from_pytorch.html
index d5adffc51..3621bcb68 100644
--- a/docs/how_to/compile_models/from_pytorch.html
+++ b/docs/how_to/compile_models/from_pytorch.html
@@ -409,10 +409,12 @@ be unstable.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: &quot;https://download.pytorch.org/models/resnet18-f37072fd.pth&quot; to /workspace/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
 
   0%|          | 0.00/44.7M [00:00&lt;?, ?B/s]
- 10%|9         | 4.34M/44.7M [00:00&lt;00:00, 45.5MB/s]
- 19%|#9        | 8.67M/44.7M [00:00&lt;00:00, 45.3MB/s]
- 69%|######9   | 31.0M/44.7M [00:00&lt;00:00, 131MB/s]
-100%|##########| 44.7M/44.7M [00:00&lt;00:00, 128MB/s]
+  3%|2         | 1.29M/44.7M [00:00&lt;00:03, 13.3MB/s]
+  9%|8         | 3.87M/44.7M [00:00&lt;00:02, 21.3MB/s]
+ 21%|##        | 9.27M/44.7M [00:00&lt;00:00, 37.3MB/s]
+ 47%|####6     | 20.9M/44.7M [00:00&lt;00:00, 70.5MB/s]
+ 96%|#########5| 42.8M/44.7M [00:00&lt;00:00, 128MB/s]
+100%|##########| 44.7M/44.7M [00:00&lt;00:00, 91.8MB/s]
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/compile_models/from_tensorflow.html b/docs/how_to/compile_models/from_tensorflow.html
index 8edabc378..a78d8e60f 100644
--- a/docs/how_to/compile_models/from_tensorflow.html
+++ b/docs/how_to/compile_models/from_tensorflow.html
@@ -631,7 +631,6 @@ banana (score = 0.00022)
 desk (score = 0.00019)
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  3.586 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-compile-models-from-tensorflow-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">from_tensorflow.py</span></code></a></p>
diff --git a/docs/how_to/compile_models/sg_execution_times.html b/docs/how_to/compile_models/sg_execution_times.html
index 091aabec4..8268ce84a 100644
--- a/docs/how_to/compile_models/sg_execution_times.html
+++ b/docs/how_to/compile_models/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-compile-models-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>04:57.533</strong> total execution time for <strong>how_to_compile_models</strong> files:</p>
+<p><strong>05:08.856</strong> total execution time for <strong>how_to_compile_models</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 81%" />
@@ -331,43 +331,43 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="from_tensorflow.html#sphx-glr-how-to-compile-models-from-tensorflow-py"><span class="std std-ref">Compile Tensorflow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tensorflow.py</span></code>)</p></td>
-<td><p>01:03.586</p></td>
+<td><p>00:59.479</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="from_darknet.html#sphx-glr-how-to-compile-models-from-darknet-py"><span class="std std-ref">Compile YOLO-V2 and YOLO-V3 in DarkNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_darknet.py</span></code>)</p></td>
-<td><p>01:02.506</p></td>
+<td><p>00:59.268</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="from_paddle.html#sphx-glr-how-to-compile-models-from-paddle-py"><span class="std std-ref">Compile PaddlePaddle Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_paddle.py</span></code>)</p></td>
-<td><p>00:39.520</p></td>
+<td><p>00:40.230</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="from_tflite.html#sphx-glr-how-to-compile-models-from-tflite-py"><span class="std std-ref">Compile TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tflite.py</span></code>)</p></td>
-<td><p>00:25.939</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="from_keras.html#sphx-glr-how-to-compile-models-from-keras-py"><span class="std std-ref">Compile Keras Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_keras.py</span></code>)</p></td>
+<td><p>00:32.086</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="from_oneflow.html#sphx-glr-how-to-compile-models-from-oneflow-py"><span class="std std-ref">Compile OneFlow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_oneflow.py</span></code>)</p></td>
-<td><p>00:25.603</p></td>
+<td><p>00:25.277</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="from_mxnet.html#sphx-glr-how-to-compile-models-from-mxnet-py"><span class="std std-ref">Compile MXNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_mxnet.py</span></code>)</p></td>
-<td><p>00:22.957</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="from_tflite.html#sphx-glr-how-to-compile-models-from-tflite-py"><span class="std std-ref">Compile TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tflite.py</span></code>)</p></td>
+<td><p>00:24.849</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="from_coreml.html#sphx-glr-how-to-compile-models-from-coreml-py"><span class="std std-ref">Compile CoreML Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_coreml.py</span></code>)</p></td>
-<td><p>00:21.666</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="from_mxnet.html#sphx-glr-how-to-compile-models-from-mxnet-py"><span class="std std-ref">Compile MXNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_mxnet.py</span></code>)</p></td>
+<td><p>00:23.270</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="from_pytorch.html#sphx-glr-how-to-compile-models-from-pytorch-py"><span class="std std-ref">Compile PyTorch Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_pytorch.py</span></code>)</p></td>
-<td><p>00:19.331</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="from_coreml.html#sphx-glr-how-to-compile-models-from-coreml-py"><span class="std std-ref">Compile CoreML Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_coreml.py</span></code>)</p></td>
+<td><p>00:22.769</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="from_keras.html#sphx-glr-how-to-compile-models-from-keras-py"><span class="std std-ref">Compile Keras Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_keras.py</span></code>)</p></td>
-<td><p>00:14.019</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="from_pytorch.html#sphx-glr-how-to-compile-models-from-pytorch-py"><span class="std std-ref">Compile PyTorch Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_pytorch.py</span></code>)</p></td>
+<td><p>00:19.076</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="from_onnx.html#sphx-glr-how-to-compile-models-from-onnx-py"><span class="std std-ref">Compile ONNX Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_onnx.py</span></code>)</p></td>
-<td><p>00:02.406</p></td>
+<td><p>00:02.552</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/how_to/deploy_models/deploy_model_on_android.html b/docs/how_to/deploy_models/deploy_model_on_android.html
index c930ec556..c7e759251 100644
--- a/docs/how_to/deploy_models/deploy_model_on_android.html
+++ b/docs/how_to/deploy_models/deploy_model_on_android.html
@@ -648,7 +648,7 @@ to the remote android device.</p>
 Evaluate inference time cost...
 Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  15.8706      15.8903      15.9835      15.7127       0.0826
+  15.7614      15.6817      16.6102      15.5854       0.2887
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/deploy_models/deploy_object_detection_pytorch.html b/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
index 15066679e..d65dacbd8 100644
--- a/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
+++ b/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
@@ -431,41 +431,14 @@ be unstable.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: &quot;https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth&quot; to /workspace/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
 
   0%|          | 0.00/170M [00:00&lt;?, ?B/s]
-  3%|3         | 5.31M/170M [00:00&lt;00:03, 55.2MB/s]
-  6%|6         | 10.6M/170M [00:00&lt;00:03, 42.6MB/s]
-  9%|8         | 14.8M/170M [00:00&lt;00:04, 33.3MB/s]
- 12%|#1        | 19.9M/170M [00:00&lt;00:03, 39.4MB/s]
- 14%|#4        | 24.2M/170M [00:00&lt;00:03, 41.2MB/s]
- 18%|#7        | 30.3M/170M [00:00&lt;00:03, 47.9MB/s]
- 21%|##        | 35.1M/170M [00:00&lt;00:03, 45.9MB/s]
- 23%|##3       | 39.7M/170M [00:00&lt;00:03, 41.0MB/s]
- 26%|##5       | 43.8M/170M [00:01&lt;00:03, 39.3MB/s]
- 29%|##8       | 49.0M/170M [00:01&lt;00:02, 43.4MB/s]
- 31%|###1      | 53.3M/170M [00:01&lt;00:02, 41.8MB/s]
- 34%|###3      | 57.4M/170M [00:01&lt;00:02, 41.0MB/s]
- 37%|###6      | 62.1M/170M [00:01&lt;00:02, 43.3MB/s]
- 40%|####      | 67.9M/170M [00:01&lt;00:02, 48.2MB/s]
- 44%|####3     | 74.1M/170M [00:01&lt;00:01, 53.0MB/s]
- 47%|####6     | 79.3M/170M [00:01&lt;00:01, 48.1MB/s]
- 49%|####9     | 84.0M/170M [00:02&lt;00:02, 39.3MB/s]
- 52%|#####2    | 89.1M/170M [00:02&lt;00:01, 42.8MB/s]
- 55%|#####5    | 93.9M/170M [00:02&lt;00:01, 44.5MB/s]
- 58%|#####7    | 98.4M/170M [00:02&lt;00:01, 45.0MB/s]
- 61%|######    | 103M/170M [00:02&lt;00:01, 42.2MB/s]
- 63%|######3   | 107M/170M [00:02&lt;00:01, 39.5MB/s]
- 66%|######6   | 113M/170M [00:02&lt;00:01, 45.0MB/s]
- 69%|######9   | 117M/170M [00:02&lt;00:01, 43.7MB/s]
- 72%|#######1  | 122M/170M [00:02&lt;00:01, 45.0MB/s]
- 74%|#######4  | 126M/170M [00:03&lt;00:00, 45.6MB/s]
- 78%|#######8  | 133M/170M [00:03&lt;00:00, 50.7MB/s]
- 81%|########  | 137M/170M [00:03&lt;00:00, 50.8MB/s]
- 84%|########3 | 142M/170M [00:03&lt;00:00, 48.9MB/s]
- 87%|########7 | 148M/170M [00:03&lt;00:00, 52.0MB/s]
- 91%|######### | 154M/170M [00:03&lt;00:00, 54.1MB/s]
- 94%|#########3| 159M/170M [00:03&lt;00:00, 49.7MB/s]
- 96%|#########6| 164M/170M [00:03&lt;00:00, 49.8MB/s]
-100%|#########9| 170M/170M [00:03&lt;00:00, 52.6MB/s]
-100%|##########| 170M/170M [00:03&lt;00:00, 45.6MB/s]
+ 11%|#         | 18.4M/170M [00:00&lt;00:00, 193MB/s]
+ 25%|##4       | 42.2M/170M [00:00&lt;00:00, 226MB/s]
+ 39%|###8      | 65.8M/170M [00:00&lt;00:00, 236MB/s]
+ 53%|#####3    | 90.7M/170M [00:00&lt;00:00, 246MB/s]
+ 68%|######7   | 115M/170M [00:00&lt;00:00, 249MB/s]
+ 82%|########1 | 139M/170M [00:00&lt;00:00, 249MB/s]
+ 96%|#########5| 163M/170M [00:00&lt;00:00, 226MB/s]
+100%|##########| 170M/170M [00:00&lt;00:00, 234MB/s]
 /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3878: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
   for i in range(dim)
 /usr/local/lib/python3.7/dist-packages/torchvision/models/detection/anchor_utils.py:127: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the &#39;trunc&#39; function NOT &#39;floor&#39;). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode=&#39;trunc&#39;), or for actual floor division, use torch.div(a, b, rounding_mode=&#39;floor&#39;).
@@ -560,7 +533,7 @@ torchvision rcnn models.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Get 9 valid boxes
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  58.197 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  57.913 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-object-detection-pytorch-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_object_detection_pytorch.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_prequantized.html b/docs/how_to/deploy_models/deploy_prequantized.html
index 72d4fdbbc..3ff2e5536 100644
--- a/docs/how_to/deploy_models/deploy_prequantized.html
+++ b/docs/how_to/deploy_models/deploy_prequantized.html
@@ -475,12 +475,10 @@ training. Other models require a full post training calibration.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: &quot;https://download.pytorch.org/models/mobilenet_v2-b0353104.pth&quot; to /workspace/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
 
   0%|          | 0.00/13.6M [00:00&lt;?, ?B/s]
-  4%|4         | 576k/13.6M [00:00&lt;00:02, 5.87MB/s]
- 12%|#1        | 1.56M/13.6M [00:00&lt;00:01, 8.56MB/s]
- 24%|##4       | 3.30M/13.6M [00:00&lt;00:00, 12.9MB/s]
- 47%|####7     | 6.38M/13.6M [00:00&lt;00:00, 20.5MB/s]
- 87%|########6 | 11.8M/13.6M [00:00&lt;00:00, 33.4MB/s]
-100%|##########| 13.6M/13.6M [00:00&lt;00:00, 27.0MB/s]
+  8%|8         | 1.09M/13.6M [00:00&lt;00:01, 11.4MB/s]
+ 25%|##5       | 3.41M/13.6M [00:00&lt;00:00, 19.0MB/s]
+ 59%|#####9    | 8.02M/13.6M [00:00&lt;00:00, 32.3MB/s]
+100%|##########| 13.6M/13.6M [00:00&lt;00:00, 38.8MB/s]
 </pre></div>
 </div>
 </div>
@@ -569,7 +567,7 @@ output values are identical out of 1000 outputs from mobilenet v2.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  90.5050      90.4569      93.1727      90.1209       0.3688
+  90.3646      90.2865      92.6330      90.2238       0.2759
 </pre></div>
 </div>
 <div class="admonition note">
@@ -608,7 +606,7 @@ This includes support for the VNNI 8 bit dot product instruction (CascadeLake or
 <div class="section" id="deploy-a-quantized-tflite-model">
 <h2>Deploy a quantized TFLite Model<a class="headerlink" href="#deploy-a-quantized-tflite-model" title="Permalink to this headline">¶</a></h2>
 <p>TODO</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  6.790 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  10.470 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-prequantized-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/fb8217c13f4351224c6cf3aacf1a87fc/deploy_prequantized.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_prequantized.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_prequantized_tflite.html b/docs/how_to/deploy_models/deploy_prequantized_tflite.html
index 83a5f013c..97f9ef778 100644
--- a/docs/how_to/deploy_models/deploy_prequantized_tflite.html
+++ b/docs/how_to/deploy_models/deploy_prequantized_tflite.html
@@ -568,7 +568,7 @@ TFLite Top-5 labels: [387 102 386 341 349]
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  119.9239     119.8118     124.9368     119.2458      0.7291
+  119.3345     119.1992     127.5463     118.7020      0.8721
 </pre></div>
 </div>
 <div class="admonition note">
@@ -596,7 +596,7 @@ network for ARM CPU</span></a>.</p></li>
 </ul>
 </div></blockquote>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  57.294 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  52.020 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-prequantized-tflite-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/56691c7a27d45da61d112276334640d3/deploy_prequantized_tflite.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_prequantized_tflite.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_quantized.html b/docs/how_to/deploy_models/deploy_quantized.html
index 2cc6e34a4..bc70867fa 100644
--- a/docs/how_to/deploy_models/deploy_quantized.html
+++ b/docs/how_to/deploy_models/deploy_quantized.html
@@ -504,7 +504,7 @@ for calibration. But the accuracy might be impacted.</p>
   DeprecationWarning,
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  31.072 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  25.102 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-quantized-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7810ecf51bfc05f7d5e8a400ac3e815d/deploy_quantized.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_quantized.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_ssd_gluoncv.html b/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
index ba40dae3a..341d0f29c 100644
--- a/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
+++ b/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
@@ -436,24 +436,24 @@ to your device.</p>
 Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/ssd_512_resnet50_v1_voc-9c8b225a.zip...
 
   0%|          | 0/132723 [00:00&lt;?, ?KB/s]
-  4%|3         | 5091/132723 [00:00&lt;00:02, 50903.37KB/s]
-  9%|9         | 12486/132723 [00:00&lt;00:01, 64456.81KB/s]
- 15%|#5        | 20121/132723 [00:00&lt;00:01, 69882.30KB/s]
- 21%|##        | 27549/132723 [00:00&lt;00:01, 71616.07KB/s]
- 27%|##6       | 35259/132723 [00:00&lt;00:01, 73591.42KB/s]
- 32%|###2      | 42899/132723 [00:00&lt;00:01, 74543.16KB/s]
- 38%|###7      | 50400/132723 [00:00&lt;00:01, 74687.76KB/s]
- 44%|####3     | 58133/132723 [00:00&lt;00:00, 75525.12KB/s]
- 50%|####9     | 66157/132723 [00:00&lt;00:00, 76994.90KB/s]
- 56%|#####5    | 74160/132723 [00:01&lt;00:00, 77927.34KB/s]
- 62%|######1   | 82120/132723 [00:01&lt;00:00, 78435.50KB/s]
- 68%|######7   | 90128/132723 [00:01&lt;00:00, 78931.33KB/s]
- 74%|#######3  | 98161/132723 [00:01&lt;00:00, 79352.72KB/s]
- 80%|#######9  | 106105/132723 [00:01&lt;00:00, 79361.67KB/s]
- 86%|########5 | 114042/132723 [00:01&lt;00:00, 78587.86KB/s]
- 92%|#########1| 122101/132723 [00:01&lt;00:00, 79184.77KB/s]
- 98%|#########8| 130224/132723 [00:01&lt;00:00, 79794.09KB/s]
-100%|##########| 132723/132723 [00:01&lt;00:00, 76502.38KB/s]
+  4%|4         | 5381/132723 [00:00&lt;00:02, 53800.82KB/s]
+ 10%|#         | 13400/132723 [00:00&lt;00:01, 69320.29KB/s]
+ 16%|#5        | 21149/132723 [00:00&lt;00:01, 73047.23KB/s]
+ 22%|##1       | 28918/132723 [00:00&lt;00:01, 74877.66KB/s]
+ 28%|##7       | 36781/132723 [00:00&lt;00:01, 76229.48KB/s]
+ 34%|###3      | 44772/132723 [00:00&lt;00:01, 77478.12KB/s]
+ 40%|###9      | 52682/132723 [00:00&lt;00:01, 78006.28KB/s]
+ 46%|####5     | 60598/132723 [00:00&lt;00:00, 78370.48KB/s]
+ 52%|#####1    | 68447/132723 [00:00&lt;00:00, 78405.94KB/s]
+ 57%|#####7    | 76288/132723 [00:01&lt;00:00, 76527.01KB/s]
+ 63%|######3   | 84247/132723 [00:01&lt;00:00, 77447.10KB/s]
+ 69%|######9   | 92033/132723 [00:01&lt;00:00, 77568.80KB/s]
+ 75%|#######5  | 99837/132723 [00:01&lt;00:00, 77707.27KB/s]
+ 81%|########1 | 107669/132723 [00:01&lt;00:00, 77884.99KB/s]
+ 87%|########7 | 115565/132723 [00:01&lt;00:00, 78204.36KB/s]
+ 93%|#########2| 123426/132723 [00:01&lt;00:00, 78324.28KB/s]
+ 99%|#########8| 131300/132723 [00:01&lt;00:00, 78447.82KB/s]
+100%|##########| 132723/132723 [00:01&lt;00:00, 76834.91KB/s]
 </pre></div>
 </div>
 <p>Create TVM runtime and do inference
@@ -496,7 +496,7 @@ Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from h
 <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
 </pre></div>
 </div>
-<img src="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" srcset="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" alt="deploy ssd gluoncv" class = "sphx-glr-single-img"/><p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  20.224 seconds)</p>
+<img src="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" srcset="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" alt="deploy ssd gluoncv" class = "sphx-glr-single-img"/><p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  18.511 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-ssd-gluoncv-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/cccb17d28e5e8b2e94ea8cd5ec59f6ed/deploy_ssd_gluoncv.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_ssd_gluoncv.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/sg_execution_times.html b/docs/how_to/deploy_models/sg_execution_times.html
index 9d9fab467..9afa9a30f 100644
--- a/docs/how_to/deploy_models/sg_execution_times.html
+++ b/docs/how_to/deploy_models/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-deploy-models-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>10:44.323</strong> total execution time for <strong>how_to_deploy_models</strong> files:</p>
+<p><strong>10:35.640</strong> total execution time for <strong>how_to_deploy_models</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 86%" />
@@ -331,31 +331,31 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="deploy_object_detection_pytorch.html#sphx-glr-how-to-deploy-models-deploy-object-detection-pytorch-py"><span class="std std-ref">Compile PyTorch Object Detection Models</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_object_detection_pytorch.py</span></code>)</p></td>
-<td><p>02:58.197</p></td>
+<td><p>02:57.913</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="deploy_ssd_gluoncv.html#sphx-glr-how-to-deploy-models-deploy-ssd-gluoncv-py"><span class="std std-ref">Deploy Single Shot Multibox Detector(SSD) model</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_ssd_gluoncv.py</span></code>)</p></td>
-<td><p>02:20.224</p></td>
+<td><p>02:18.511</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="deploy_prequantized_tflite.html#sphx-glr-how-to-deploy-models-deploy-prequantized-tflite-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized_tflite.py</span></code>)</p></td>
-<td><p>01:57.294</p></td>
+<td><p>01:52.020</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="deploy_quantized.html#sphx-glr-how-to-deploy-models-deploy-quantized-py"><span class="std std-ref">Deploy a Quantized Model on Cuda</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_quantized.py</span></code>)</p></td>
-<td><p>01:31.072</p></td>
+<td><p>01:25.102</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="deploy_prequantized.html#sphx-glr-how-to-deploy-models-deploy-prequantized-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized.py</span></code>)</p></td>
-<td><p>01:06.790</p></td>
+<td><p>01:10.470</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="deploy_model_on_android.html#sphx-glr-how-to-deploy-models-deploy-model-on-android-py"><span class="std std-ref">Deploy the Pretrained Model on Android</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_android.py</span></code>)</p></td>
-<td><p>00:28.246</p></td>
+<td><p>00:30.126</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="deploy_model_on_rasp.html#sphx-glr-how-to-deploy-models-deploy-model-on-rasp-py"><span class="std std-ref">Deploy the Pretrained Model on Raspberry Pi</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_rasp.py</span></code>)</p></td>
-<td><p>00:22.493</p></td>
+<td><p>00:21.492</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="deploy_sparse.html#sphx-glr-how-to-deploy-models-deploy-sparse-py"><span class="std std-ref">Deploy a Hugging Face Pruned Model on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_sparse.py</span></code>)</p></td>
diff --git a/docs/how_to/extend_tvm/bring_your_own_datatypes.html b/docs/how_to/extend_tvm/bring_your_own_datatypes.html
index 8fa067289..9606c1dd4 100644
--- a/docs/how_to/extend_tvm/bring_your_own_datatypes.html
+++ b/docs/how_to/extend_tvm/bring_your_own_datatypes.html
@@ -607,7 +607,7 @@ In this alpha state of the Bring Your Own Datatypes framework, we have not imple
 <span class="n">module</span><span class="p">,</span> <a href="https://docs.python.org/3/library/stdtypes.html#dict" title="builtins.dict" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">params</span></a> <span class="o">=</span> <span class="n">get_mobilenet</span><span class="p">()</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip009d8886-d77e-484a-8d9e-3d2749623c9e from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip99d12be9-192c-4776-b24a-3784472088b8 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
 </pre></div>
 </div>
 <p>It’s easy to execute MobileNet with native TVM:</p>
diff --git a/docs/how_to/extend_tvm/sg_execution_times.html b/docs/how_to/extend_tvm/sg_execution_times.html
index e78fa164d..37c64a974 100644
--- a/docs/how_to/extend_tvm/sg_execution_times.html
+++ b/docs/how_to/extend_tvm/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-extend-tvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:39.888</strong> total execution time for <strong>how_to_extend_tvm</strong> files:</p>
+<p><strong>00:39.452</strong> total execution time for <strong>how_to_extend_tvm</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 84%" />
@@ -331,19 +331,19 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="bring_your_own_datatypes.html#sphx-glr-how-to-extend-tvm-bring-your-own-datatypes-py"><span class="std std-ref">Bring Your Own Datatypes to TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">bring_your_own_datatypes.py</span></code>)</p></td>
-<td><p>00:36.328</p></td>
+<td><p>00:36.319</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="use_pass_instrument.html#sphx-glr-how-to-extend-tvm-use-pass-instrument-py"><span class="std std-ref">How to Use TVM Pass Instrument</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_instrument.py</span></code>)</p></td>
-<td><p>00:02.637</p></td>
+<td><p>00:02.203</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="use_pass_infra.html#sphx-glr-how-to-extend-tvm-use-pass-infra-py"><span class="std std-ref">How to Use TVM Pass Infra</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_infra.py</span></code>)</p></td>
-<td><p>00:00.916</p></td>
+<td><p>00:00.921</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="low_level_custom_pass.html#sphx-glr-how-to-extend-tvm-low-level-custom-pass-py"><span class="std std-ref">Writing a Customized Pass</span></a> (<code class="docutils literal notranslate"><span class="pre">low_level_custom_pass.py</span></code>)</p></td>
-<td><p>00:00.007</p></td>
+<td><p>00:00.008</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/how_to/extend_tvm/use_pass_instrument.html b/docs/how_to/extend_tvm/use_pass_instrument.html
index 482e89832..c91b8b6ec 100644
--- a/docs/how_to/extend_tvm/use_pass_instrument.html
+++ b/docs/how_to/extend_tvm/use_pass_instrument.html
@@ -507,10 +507,10 @@ profile the execution time of each passes.</p>
 </pre></div>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Printing results of timing profile...
-InferType: 9915us [9915us] (44.98%; 44.98%)
-FoldScaleAxis: 12127us [6us] (55.02%; 55.02%)
-        FoldConstant: 12120us [2351us] (54.99%; 99.95%)
-                InferType: 9770us [9770us] (44.32%; 80.61%)
+InferType: 6763us [6763us] (45.62%; 45.62%)
+FoldScaleAxis: 8063us [6us] (54.38%; 54.38%)
+        FoldConstant: 8057us [1625us] (54.34%; 99.92%)
+                InferType: 6432us [6432us] (43.38%; 79.83%)
 </pre></div>
 </div>
 </div>
@@ -532,10 +532,10 @@ Refer to following sections and <a class="reference internal" href="../../refere
 </pre></div>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Printing results of timing profile...
-InferType: 9507us [9507us] (44.51%; 44.51%)
-FoldScaleAxis: 11853us [5us] (55.49%; 55.49%)
-        FoldConstant: 11848us [2500us] (55.47%; 99.95%)
-                InferType: 9348us [9348us] (43.76%; 78.90%)
+InferType: 6405us [6405us] (44.86%; 44.86%)
+FoldScaleAxis: 7871us [5us] (55.14%; 55.14%)
+        FoldConstant: 7867us [1612us] (55.10%; 99.94%)
+                InferType: 6254us [6254us] (43.81%; 79.50%)
 </pre></div>
 </div>
 <p>Register empty list to clear existing instruments.</p>
diff --git a/docs/how_to/optimize_operators/opt_conv_cuda.html b/docs/how_to/optimize_operators/opt_conv_cuda.html
index ef99990de..f66d5188f 100644
--- a/docs/how_to/optimize_operators/opt_conv_cuda.html
+++ b/docs/how_to/optimize_operators/opt_conv_cuda.html
@@ -559,7 +559,7 @@ latency of convolution.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Convolution: </span><span class="si">%f</span><span class="s2"> ms&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span> <span class="o">*</span> <span cl [...]
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Convolution: 39.655540 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Convolution: 40.681556 ms
 </pre></div>
 </div>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-optimize-operators-opt-conv-cuda-py">
diff --git a/docs/how_to/optimize_operators/opt_conv_tensorcore.html b/docs/how_to/optimize_operators/opt_conv_tensorcore.html
index f47ff069d..864d53b3d 100644
--- a/docs/how_to/optimize_operators/opt_conv_tensorcore.html
+++ b/docs/how_to/optimize_operators/opt_conv_tensorcore.html
@@ -901,7 +901,7 @@ be able to run on our build server</p>
     <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;conv2d with tensor core: </span><span class="si">%f</span><span class="s2"> ms&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span> <span class="o">* [...]
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>conv2d with tensor core: 8.041140 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>conv2d with tensor core: 13.369779 ms
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/optimize_operators/opt_gemm.html b/docs/how_to/optimize_operators/opt_gemm.html
index 8eddef16f..dff16c4a4 100644
--- a/docs/how_to/optimize_operators/opt_gemm.html
+++ b/docs/how_to/optimize_operators/opt_gemm.html
@@ -456,8 +456,8 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Baseline: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.018982
-Baseline: 3.371941
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.018258
+Baseline: 3.408796
 </pre></div>
 </div>
 <p>In TVM, we can always inspect lower level IR to debug or optimize our schedule.
@@ -517,7 +517,7 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt1: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt1: 0.303872
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt1: 0.284209
 </pre></div>
 </div>
 <p>Here is the generated IR after blocking.</p>
@@ -584,7 +584,7 @@ vastly.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt2: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt2: 0.344798
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt2: 0.330278
 </pre></div>
 </div>
 <p>Here is the generated IR after vectorization.</p>
@@ -645,7 +645,7 @@ the access pattern for A matrix is more cache friendly.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt3: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt3: 0.119397
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt3: 0.116985
 </pre></div>
 </div>
 <p>Here is the generated IR after loop permutation.</p>
@@ -728,7 +728,7 @@ flattening.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt4: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt4: 0.110414
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt4: 0.110756
 </pre></div>
 </div>
 <p>Here is the generated IR after array packing.</p>
@@ -814,7 +814,7 @@ write to C when all the block results are ready.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt5: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt5: 0.111262
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt5: 0.111863
 </pre></div>
 </div>
 <p>Here is the generated IR after blocking.</p>
@@ -904,7 +904,7 @@ write to C when all the block results are ready.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt6: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">opt6_time</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt6: 0.144883
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt6: 0.145240
 </pre></div>
 </div>
 <p>Here is the generated IR after parallelization.</p>
diff --git a/docs/how_to/optimize_operators/sg_execution_times.html b/docs/how_to/optimize_operators/sg_execution_times.html
index 607845de7..0de64de4e 100644
--- a/docs/how_to/optimize_operators/sg_execution_times.html
+++ b/docs/how_to/optimize_operators/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-optimize-operators-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:34.575</strong> total execution time for <strong>how_to_optimize_operators</strong> files:</p>
+<p><strong>00:34.270</strong> total execution time for <strong>how_to_optimize_operators</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 83%" />
@@ -331,15 +331,15 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="opt_gemm.html#sphx-glr-how-to-optimize-operators-opt-gemm-py"><span class="std std-ref">How to optimize GEMM on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_gemm.py</span></code>)</p></td>
-<td><p>00:32.262</p></td>
+<td><p>00:31.892</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="opt_conv_tensorcore.html#sphx-glr-how-to-optimize-operators-opt-conv-tensorcore-py"><span class="std std-ref">How to optimize convolution using TensorCores</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_tensorcore.py</span></code>)</p></td>
-<td><p>00:01.291</p></td>
+<td><p>00:01.362</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="opt_conv_cuda.html#sphx-glr-how-to-optimize-operators-opt-conv-cuda-py"><span class="std std-ref">How to optimize convolution on GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_cuda.py</span></code>)</p></td>
-<td><p>00:01.022</p></td>
+<td><p>00:01.015</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/how_to/tune_with_autoscheduler/sg_execution_times.html b/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
index 3b96d23c0..30ee80297 100644
--- a/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
+++ b/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-tune-with-autoscheduler-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>05:24.599</strong> total execution time for <strong>how_to_tune_with_autoscheduler</strong> files:</p>
+<p><strong>05:37.999</strong> total execution time for <strong>how_to_tune_with_autoscheduler</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 85%" />
@@ -331,27 +331,27 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_conv2d_layer_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-conv2d-layer-cuda-py"><span class="std std-ref">Auto-scheduling a Convolution Layer for GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_layer_cuda.py</span></code>)</p></td>
-<td><p>02:46.458</p></td>
+<td><p>02:57.713</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tune_network_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-x86-py"><span class="std std-ref">Auto-scheduling a Neural Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_x86.py</span></code>)</p></td>
-<td><p>01:20.837</p></td>
+<td><p>01:20.722</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_network_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-cuda-py"><span class="std std-ref">Auto-scheduling a Neural Network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_cuda.py</span></code>)</p></td>
-<td><p>00:43.350</p></td>
+<td><p>00:43.441</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tune_sparse_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-sparse-x86-py"><span class="std std-ref">Auto-scheduling Sparse Matrix Multiplication on CPU with Custom Sketch Rule</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_sparse_x86.py</span></code>)</p></td>
-<td><p>00:17.110</p></td>
+<td><p>00:18.298</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_network_mali.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-mali-py"><span class="std std-ref">Auto-scheduling a Neural Network for mali GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_mali.py</span></code>)</p></td>
-<td><p>00:08.596</p></td>
+<td><p>00:09.001</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tune_network_arm.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-arm-py"><span class="std std-ref">Auto-scheduling a Neural Network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_arm.py</span></code>)</p></td>
-<td><p>00:08.249</p></td>
+<td><p>00:08.824</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html b/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
index 57e8d8fa1..cd0584ba8 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
@@ -470,9 +470,6 @@ file and apply it.</p>
 <span class="k">del</span> <span class="n">measure_ctx</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>.T
-</pre></div>
-</div>
 <p>We can lower the schedule to see the IR after auto-scheduling.
 The auto-scheduler correctly performs optimizations including multi-level tiling,
 cooperative fetching, unrolling and operator fusion.</p>
@@ -489,53 +486,270 @@ cooperative fetching, unrolling and operator fusion.</p>
              compute: Buffer(compute_2: Pointer(float32), float32, [25088], [])}
   buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute}
   preflattened_buffer_map = {data_1: data_3: Buffer(data_2, float32, [1, 512, 7, 7], []), kernel_1: kernel_3: Buffer(kernel_2, float32, [512, 512, 3, 3], []), bias_1: bias_3: Buffer(bias_2, float32, [1, 512, 1, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [1, 512, 7, 7], [])} {
-  attr [IterVar(blockIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;blockIdx.x&quot;)] &quot;thread_extent&quot; = 112;
-  allocate(conv2d_nchw: Pointer(local float32), float32, [7]), storage_scope = local;
-  allocate(pad_temp.shared: Pointer(shared float32), float32, [54]), storage_scope = shared;
-  allocate(kernel.shared: Pointer(shared float32), float32, [576]), storage_scope = shared;
-  attr [IterVar(threadIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 32 {
-    conv2d_nchw_1: Buffer(conv2d_nchw, float32, [1], [], scope=&quot;local&quot;, align=4)[0] = 0f32
+  attr [IterVar(blockIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;blockIdx.x&quot;)] &quot;thread_extent&quot; = 32;
+  allocate(conv2d_nchw: Pointer(local float32), float32, [28]), storage_scope = local;
+  allocate(pad_temp.shared: Pointer(shared float32), float32, [1008]), storage_scope = shared;
+  allocate(kernel.shared: Pointer(shared float32), float32, [768]), storage_scope = shared;
+  attr [IterVar(threadIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28 {
+    conv2d_nchw_1: Buffer(conv2d_nchw, float32, [28], [], scope=&quot;local&quot;, align=64)[0] = 0f32
+    conv2d_nchw_1[7] = 0f32
     conv2d_nchw_1[1] = 0f32
+    conv2d_nchw_1[8] = 0f32
     conv2d_nchw_1[2] = 0f32
+    conv2d_nchw_1[9] = 0f32
     conv2d_nchw_1[3] = 0f32
+    conv2d_nchw_1[10] = 0f32
     conv2d_nchw_1[4] = 0f32
+    conv2d_nchw_1[11] = 0f32
     conv2d_nchw_1[5] = 0f32
+    conv2d_nchw_1[12] = 0f32
     conv2d_nchw_1[6] = 0f32
-    for (rc.outer.outer: int32, 0, 256) {
-      for (ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer: int32, 0, 2) {
-        attr [IterVar(threadIdx.x_1: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 32;
-        if @tir.likely((((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer*16) + floordiv(threadIdx.x_1, 2)) &lt; 27), dtype=bool) {
-          pad_temp.shared_1: Buffer(pad_temp.shared, float32, [54], [], scope=&quot;shared&quot;)[((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer*32) + threadIdx.x_1)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod(((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer*5) + threadIdx.x_1), 27), 9) + floormod(blockIdx.x, 7))) &amp;&amp; ((floordiv(floormod(((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer*5) + threadIdx.x_1), 27), 9) + floormod(blockIdx.x, 7)) &lt; 8)) &amp;&amp; (1 &lt;= floo [...]
-        }
-      }
-      for (ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer_1: int32, 0, 9) {
-        attr [IterVar(threadIdx.x_2: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 32;
-        kernel.shared_1: Buffer(kernel.shared, float32, [576], [], scope=&quot;shared&quot;)[ramp(((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer_1*64) + (threadIdx.x_2*2)), 1, 2)] = kernel[ramp(((((floordiv(blockIdx.x, 7)*147456) + (floordiv(((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer_1*64) + (threadIdx.x_2*2)), 18)*4608)) + (rc.outer.outer*18)) + floormod(((ax0.ax1.fused.ax2.fused.ax3.fused.outer.outer_1*10) + (threadIdx.x_2*2)), 18)), 1, 2)]
-      }
-      for (rc.inner: int32, 0, 2) {
-        for (ry.inner: int32, 0, 3) {
-          for (rx.inner: int32, 0, 3) {
-            let cse_var_1: int32 = (((rc.inner*27) + (ry.inner*9)) + rx.inner)
-             {
-              conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[cse_var_1]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-              conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(cse_var_1 + 1)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-              conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(cse_var_1 + 2)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-              conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(cse_var_1 + 3)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-              conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(cse_var_1 + 4)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-              conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(cse_var_1 + 5)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-              conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(cse_var_1 + 6)]*kernel.shared_1[((((threadIdx.x*18) + (rc.inner*9)) + (ry.inner*3)) + rx.inner)]))
-            }
+    conv2d_nchw_1[13] = 0f32
+    conv2d_nchw_1[14] = 0f32
+    conv2d_nchw_1[21] = 0f32
+    conv2d_nchw_1[15] = 0f32
+    conv2d_nchw_1[22] = 0f32
+    conv2d_nchw_1[16] = 0f32
+    conv2d_nchw_1[23] = 0f32
+    conv2d_nchw_1[17] = 0f32
+    conv2d_nchw_1[24] = 0f32
+    conv2d_nchw_1[18] = 0f32
+    conv2d_nchw_1[25] = 0f32
+    conv2d_nchw_1[19] = 0f32
+    conv2d_nchw_1[26] = 0f32
+    conv2d_nchw_1[20] = 0f32
+    conv2d_nchw_1[27] = 0f32
+    for (rc.outer.outer: int32, 0, 32) {
+      for (ry.outer.outer: int32, 0, 3) {
+        let cse_var_4: int32 = (rc.outer.outer*784)
+        let cse_var_3: int32 = (ry.outer.outer*7)
+        let cse_var_2: int32 = (rc.outer.outer*144)
+        let cse_var_1: int32 = (ry.outer.outer*3)
+         {
+          attr [IterVar(threadIdx.x_1: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1: Buffer(pad_temp.shared, float32, [1008], [], scope=&quot;shared&quot;)[threadIdx.x_1] = @tir.if_then_else((((1 &lt;= (floordiv(threadIdx.x_1, 9) + ry.outer.outer)) &amp;&amp; (1 &lt;= floormod(threadIdx.x_1, 9))) &amp;&amp; (floormod(threadIdx.x_1, 9) &lt; 8)), data[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 28)] = @tir.if_then_else(((((floordiv((threadIdx.x_1 + 28), 9) + ry.outer.outer) &lt; 8) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 1), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 1), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 28), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 56)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 2), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 2), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 56), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 84)] = @tir.if_then_else(((1 &lt;= floormod((threadIdx.x_1 + 3), 9)) &amp;&amp; (floormod((threadIdx.x_1 + 3), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 84), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 112)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 4), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 4), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 112), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 140)] = @tir.if_then_else(((1 &lt;= floormod((threadIdx.x_1 + 5), 9)) &amp;&amp; (floormod((threadIdx.x_1 + 5), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 140), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 168)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 6), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 6), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 168), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 196)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 7), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 7), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 196), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 224)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) &lt; 8) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 8), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 8), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 224), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 252)] = @tir.if_then_else((((1 &lt;= (floordiv(threadIdx.x_1, 9) + ry.outer.outer)) &amp;&amp; (1 &lt;= floormod(threadIdx.x_1, 9))) &amp;&amp; (floormod(threadIdx.x_1, 9) &lt; 8)), data[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) + 188)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 280)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer) &lt; 8) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 1), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 1), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 280), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 308)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 2), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 2), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 308), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 336)] = @tir.if_then_else(((1 &lt;= floormod((threadIdx.x_1 + 3), 9)) &amp;&amp; (floormod((threadIdx.x_1 + 3), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 336), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 364)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 4), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 4), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 364), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else(((1 &lt;= floormod((threadIdx.x_1 + 5), 9)) &amp;&amp; (floormod((threadIdx.x_1 + 5), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 392), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 420)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 6), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 6), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 420), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 448)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 7), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 7), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 448), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 476)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) &lt; 8) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 8), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 8), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 476), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 504)] = @tir.if_then_else((((1 &lt;= (floordiv(threadIdx.x_1, 9) + ry.outer.outer)) &amp;&amp; (1 &lt;= floormod(threadIdx.x_1, 9))) &amp;&amp; (floormod(threadIdx.x_1, 9) &lt; 8)), data[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) + 384)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 532)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer) &lt; 8) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 1), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 1), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 532), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 560)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 2), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 2), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 560), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 588)] = @tir.if_then_else(((1 &lt;= floormod((threadIdx.x_1 + 3), 9)) &amp;&amp; (floormod((threadIdx.x_1 + 3), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 588), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 616)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 4), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 4), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 616), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 644)] = @tir.if_then_else(((1 &lt;= floormod((threadIdx.x_1 + 5), 9)) &amp;&amp; (floormod((threadIdx.x_1 + 5), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 644), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 672)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 6), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 6), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 672), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 700)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 7), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 7), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 700), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 728)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) &lt; 8) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 8), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 8), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 728), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 756)] = @tir.if_then_else((((1 &lt;= (floordiv(threadIdx.x_1, 9) + ry.outer.outer)) &amp;&amp; (1 &lt;= floormod(threadIdx.x_1, 9))) &amp;&amp; (floormod(threadIdx.x_1, 9) &lt; 8)), data[((((cse_var_4 + (floordiv(threadIdx.x_1, 9)*7)) + cse_var_3) + floormod(threadIdx.x_1, 9)) + 580)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 28), 63), 9) + ry.outer.outer) &lt; 8) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 1), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 1), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 784), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 812)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 56), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 2), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 2), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 812), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 840)] = @tir.if_then_else(((1 &lt;= floormod((threadIdx.x_1 + 3), 9)) &amp;&amp; (floormod((threadIdx.x_1 + 3), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 840), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 868)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 49), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 4), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 4), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 868), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 896)] = @tir.if_then_else(((1 &lt;= floormod((threadIdx.x_1 + 5), 9)) &amp;&amp; (floormod((threadIdx.x_1 + 5), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 896), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 924)] = @tir.if_then_else(((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1 + 42), 63), 9) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 6), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 6), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 924), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 952)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1 + 7), 63), 9) + ry.outer.outer)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 7), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 7), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 952), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          pad_temp.shared_1[(threadIdx.x_1 + 980)] = @tir.if_then_else(((((floordiv(floormod((threadIdx.x_1 + 35), 63), 9) + ry.outer.outer) &lt; 8) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1 + 8), 9))) &amp;&amp; (floormod((threadIdx.x_1 + 8), 9) &lt; 8)), data[((((cse_var_4 + (floordiv((threadIdx.x_1 + 980), 9)*7)) + cse_var_3) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
+          attr [IterVar(threadIdx.x_2: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1: Buffer(kernel.shared, float32, [768], [], scope=&quot;shared&quot;)[threadIdx.x_2] = kernel[(((((blockIdx.x*73728) + cse_var_2) + (floordiv(threadIdx.x_2, 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 28)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 28), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 28), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 56)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 56), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 8), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 84)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 84), 48)*4608)) + cse_var_2) + (floormod((floordiv(threadIdx.x_2, 3) + 12), 16)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 112)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 112), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 16), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 140)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 140), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 44), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 168)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 168), 48)*4608)) + cse_var_2) + (floormod((floordiv(threadIdx.x_2, 3) + 8), 16)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 196)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 196), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 4), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 224)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 224), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 32), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 252)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 252), 48)*4608)) + cse_var_2) + ((floordiv(threadIdx.x_2, 3) + 4)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 280)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 280), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 40), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 308)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 308), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 20), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 336)] = kernel[((((((blockIdx.x*73728) + cse_var_2) + (floordiv(threadIdx.x_2, 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 32256)]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 364)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 364), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 28), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 392), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 8), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 420)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 420), 48)*4608)) + cse_var_2) + (floormod((floordiv(threadIdx.x_2, 3) + 12), 16)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 448)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 448), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 16), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 476)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 476), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 44), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 504)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 504), 48)*4608)) + cse_var_2) + (floormod((floordiv(threadIdx.x_2, 3) + 8), 16)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 532)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 532), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 4), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 560)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 560), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 32), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 588)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 588), 48)*4608)) + cse_var_2) + ((floordiv(threadIdx.x_2, 3) + 4)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 616)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 616), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 40), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 644)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 644), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 20), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 672)] = kernel[((((((blockIdx.x*73728) + cse_var_2) + (floordiv(threadIdx.x_2, 3)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3)) + 64512)]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 700)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 700), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 28), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 1), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          kernel.shared_1[(threadIdx.x_2 + 728)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 728), 48)*4608)) + cse_var_2) + (floordiv(floormod((threadIdx.x_2 + 8), 48), 3)*9)) + cse_var_1) + floormod((threadIdx.x_2 + 2), 3))]
+          attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 28;
+          if @tir.likely((threadIdx.x_2 &lt; 12), dtype=bool) {
+            kernel.shared_1[(threadIdx.x_2 + 756)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 756), 48)*4608)) + cse_var_2) + ((floordiv(threadIdx.x_2, 3) + 12)*9)) + cse_var_1) + floormod(threadIdx.x_2, 3))]
+          }
+          for (rc.outer.inner: int32, 0, 16) {
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((rc.outer.inner*63) + floormod(threadIdx.x, 7))]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+            conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((rc.outer.inner*63) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+            conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+            conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 27)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+            conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 27)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+            conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 36)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+            conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 36)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+            conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 45)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+            conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 45)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+            conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 54)]*kernel.shared_1[((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3))]))
+            conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 54)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 48)]))
+            conv2d_nchw_1[14] = (conv2d_nchw_1[14] + (pad_temp.shared_1[((rc.outer.inner*63) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+            conv2d_nchw_1[21] = (conv2d_nchw_1[21] + (pad_temp.shared_1[((rc.outer.inner*63) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+            conv2d_nchw_1[15] = (conv2d_nchw_1[15] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+            conv2d_nchw_1[22] = (conv2d_nchw_1[22] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+            conv2d_nchw_1[16] = (conv2d_nchw_1[16] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+            conv2d_nchw_1[23] = (conv2d_nchw_1[23] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+            conv2d_nchw_1[17] = (conv2d_nchw_1[17] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 27)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+            conv2d_nchw_1[24] = (conv2d_nchw_1[24] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 27)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+            conv2d_nchw_1[18] = (conv2d_nchw_1[18] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 36)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+            conv2d_nchw_1[25] = (conv2d_nchw_1[25] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 36)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+            conv2d_nchw_1[19] = (conv2d_nchw_1[19] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 45)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+            conv2d_nchw_1[26] = (conv2d_nchw_1[26] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 45)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+            conv2d_nchw_1[20] = (conv2d_nchw_1[20] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 54)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 96)]))
+            conv2d_nchw_1[27] = (conv2d_nchw_1[27] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 54)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 144)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+            conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+            conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+            conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 28)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+            conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 28)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+            conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 37)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+            conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 37)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+            conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 46)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+            conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 46)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+            conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 55)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 1)]))
+            conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 55)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 49)]))
+            conv2d_nchw_1[14] = (conv2d_nchw_1[14] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+            conv2d_nchw_1[21] = (conv2d_nchw_1[21] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+            conv2d_nchw_1[15] = (conv2d_nchw_1[15] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+            conv2d_nchw_1[22] = (conv2d_nchw_1[22] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+            conv2d_nchw_1[16] = (conv2d_nchw_1[16] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+            conv2d_nchw_1[23] = (conv2d_nchw_1[23] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+            conv2d_nchw_1[17] = (conv2d_nchw_1[17] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 28)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+            conv2d_nchw_1[24] = (conv2d_nchw_1[24] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 28)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+            conv2d_nchw_1[18] = (conv2d_nchw_1[18] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 37)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+            conv2d_nchw_1[25] = (conv2d_nchw_1[25] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 37)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+            conv2d_nchw_1[19] = (conv2d_nchw_1[19] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 46)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+            conv2d_nchw_1[26] = (conv2d_nchw_1[26] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 46)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+            conv2d_nchw_1[20] = (conv2d_nchw_1[20] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 55)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 97)]))
+            conv2d_nchw_1[27] = (conv2d_nchw_1[27] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 55)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 145)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+            conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+            conv2d_nchw_1[8] = (conv2d_nchw_1[8] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+            conv2d_nchw_1[9] = (conv2d_nchw_1[9] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 29)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+            conv2d_nchw_1[10] = (conv2d_nchw_1[10] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 29)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+            conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 38)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+            conv2d_nchw_1[11] = (conv2d_nchw_1[11] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 38)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+            conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 47)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+            conv2d_nchw_1[12] = (conv2d_nchw_1[12] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 47)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+            conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 56)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 2)]))
+            conv2d_nchw_1[13] = (conv2d_nchw_1[13] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 56)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 50)]))
+            conv2d_nchw_1[14] = (conv2d_nchw_1[14] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+            conv2d_nchw_1[21] = (conv2d_nchw_1[21] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+            conv2d_nchw_1[15] = (conv2d_nchw_1[15] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+            conv2d_nchw_1[22] = (conv2d_nchw_1[22] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+            conv2d_nchw_1[16] = (conv2d_nchw_1[16] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+            conv2d_nchw_1[23] = (conv2d_nchw_1[23] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+            conv2d_nchw_1[17] = (conv2d_nchw_1[17] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 29)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+            conv2d_nchw_1[24] = (conv2d_nchw_1[24] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 29)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+            conv2d_nchw_1[18] = (conv2d_nchw_1[18] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 38)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+            conv2d_nchw_1[25] = (conv2d_nchw_1[25] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 38)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+            conv2d_nchw_1[19] = (conv2d_nchw_1[19] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 47)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+            conv2d_nchw_1[26] = (conv2d_nchw_1[26] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 47)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
+            conv2d_nchw_1[20] = (conv2d_nchw_1[20] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 56)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 98)]))
+            conv2d_nchw_1[27] = (conv2d_nchw_1[27] + (pad_temp.shared_1[(((rc.outer.inner*63) + floormod(threadIdx.x, 7)) + 56)]*kernel.shared_1[(((floordiv(threadIdx.x, 7)*192) + (rc.outer.inner*3)) + 146)]))
           }
         }
       }
     }
-    compute[(((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7))] = max((conv2d_nchw_1[0] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-    compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 1)] = max((conv2d_nchw_1[1] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-    compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 2)] = max((conv2d_nchw_1[2] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-    compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 3)] = max((conv2d_nchw_1[3] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-    compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 4)] = max((conv2d_nchw_1[4] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-    compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 5)] = max((conv2d_nchw_1[5] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
-    compute[((((floordiv(blockIdx.x, 7)*1568) + (threadIdx.x*49)) + (floormod(blockIdx.x, 7)*7)) + 6)] = max((conv2d_nchw_1[6] + bias[((floordiv(blockIdx.x, 7)*32) + threadIdx.x)]), 0f32)
+    for (i1.inner: int32, 0, 4) {
+      for (i2.inner: int32, 0, 7) {
+        compute[(((((blockIdx.x*784) + (floordiv(threadIdx.x, 7)*196)) + (i1.inner*49)) + (i2.inner*7)) + floormod(threadIdx.x, 7))] = max((conv2d_nchw_1[((i1.inner*7) + i2.inner)] + bias[(((blockIdx.x*16) + (floordiv(threadIdx.x, 7)*4)) + i1.inner)]), 0f32)
+      }
+    }
   }
 }
 </pre></div>
@@ -571,7 +785,7 @@ cooperative fetching, unrolling and operator fusion.</p>
 <span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 0.311 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 0.355 ms
 </pre></div>
 </div>
 </div>
@@ -600,37 +814,37 @@ conv2d_nchw_nn_o_i, conv2d_nchw_nn_i = s[conv2d_nchw].split(conv2d_nchw_nn, fact
 conv2d_nchw_nn_o_o_i, conv2d_nchw_nn_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_i, factor=1)
 conv2d_nchw_nn_o_o_o_i, conv2d_nchw_nn_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_i, factor=1)
 conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_o_i, factor=1)
-conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=1)
-conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=1)
-conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=32)
+conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=2)
+conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=2)
+conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=4)
 conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=1)
 conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1)
-conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
+conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=7)
 conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=1)
 conv2d_nchw_yy_o_o_o_o, conv2d_nchw_yy_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_o_i, factor=1)
 conv2d_nchw_xx_o_i, conv2d_nchw_xx_i = s[conv2d_nchw].split(conv2d_nchw_xx, factor=1)
 conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=1)
-conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=1)
-conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=7)
-conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=2)
-conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=1)
-conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=3)
+conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=7)
+conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
+conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=1)
+conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=16)
+conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=1)
 conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
-conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=3)
-conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1)
+conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=1)
+conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=3)
 s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2d_nc [...]
 compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
 compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
 compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
-compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=1)
-compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=32)
+compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=4)
+compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=4)
 compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
-compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
+compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=7)
 compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=1)
 compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
 compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=1)
-compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)
-compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=7)
+compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=7)
+compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
 s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
 s[conv2d_nchw].compute_at(s[compute], compute_i3_o_i)
 kernel_shared = s.cache_read(kernel, &quot;shared&quot;, [conv2d_nchw])
@@ -647,16 +861,16 @@ s[compute].bind(compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused, t
 compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i)
 s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis(&quot;threadIdx.x&quot;))
 kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
-kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=2)
+kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
 s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=32)
+kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=28)
 s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis(&quot;threadIdx.x&quot;))
 pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
 pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
 s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
-pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=32)
+pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=28)
 s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis(&quot;threadIdx.x&quot;))
-s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, &quot;auto_unroll_max_step&quot;, 0)
+s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, &quot;auto_unroll_max_step&quot;, 1024)
 s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, &quot;unroll_explicit&quot;, True)
 
 CUDA source code:
@@ -674,49 +888,201 @@ CUDA source code:
   #define int64_t long long
   #define uint64_t unsigned long long
 #endif
-extern &quot;C&quot; __global__ void __launch_bounds__(32) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
-  float conv2d_nchw[7];
-  __shared__ float pad_temp_shared[54];
-  __shared__ float kernel_shared[576];
+extern &quot;C&quot; __global__ void __launch_bounds__(28) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
+  float conv2d_nchw[28];
+  __shared__ float pad_temp_shared[1008];
+  __shared__ float kernel_shared[768];
   conv2d_nchw[0] = 0.000000e+00f;
+  conv2d_nchw[7] = 0.000000e+00f;
   conv2d_nchw[1] = 0.000000e+00f;
+  conv2d_nchw[8] = 0.000000e+00f;
   conv2d_nchw[2] = 0.000000e+00f;
+  conv2d_nchw[9] = 0.000000e+00f;
   conv2d_nchw[3] = 0.000000e+00f;
+  conv2d_nchw[10] = 0.000000e+00f;
   conv2d_nchw[4] = 0.000000e+00f;
+  conv2d_nchw[11] = 0.000000e+00f;
   conv2d_nchw[5] = 0.000000e+00f;
+  conv2d_nchw[12] = 0.000000e+00f;
   conv2d_nchw[6] = 0.000000e+00f;
-  for (int rc_outer_outer = 0; rc_outer_outer &lt; 256; ++rc_outer_outer) {
-    __syncthreads();
-    for (int ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer = 0; ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer &lt; 2; ++ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer) {
-      if (((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer * 16) + (((int)threadIdx.x) &gt;&gt; 1)) &lt; 27) {
-        pad_temp_shared[((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer * 32) + ((int)threadIdx.x))] = (((((1 &lt;= (((((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer * 5) + ((int)threadIdx.x)) % 27) / 9) + (((int)blockIdx.x) % 7))) &amp;&amp; ((((((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer * 5) + ((int)threadIdx.x)) % 27) / 9) + (((int)blockIdx.x) % 7)) &lt; 8)) &amp;&amp; (1 &lt;= (((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer * 5) + ((int)threadIdx.x)) % 9))) &amp;&amp; ((((ax0_ [...]
+  conv2d_nchw[13] = 0.000000e+00f;
+  conv2d_nchw[14] = 0.000000e+00f;
+  conv2d_nchw[21] = 0.000000e+00f;
+  conv2d_nchw[15] = 0.000000e+00f;
+  conv2d_nchw[22] = 0.000000e+00f;
+  conv2d_nchw[16] = 0.000000e+00f;
+  conv2d_nchw[23] = 0.000000e+00f;
+  conv2d_nchw[17] = 0.000000e+00f;
+  conv2d_nchw[24] = 0.000000e+00f;
+  conv2d_nchw[18] = 0.000000e+00f;
+  conv2d_nchw[25] = 0.000000e+00f;
+  conv2d_nchw[19] = 0.000000e+00f;
+  conv2d_nchw[26] = 0.000000e+00f;
+  conv2d_nchw[20] = 0.000000e+00f;
+  conv2d_nchw[27] = 0.000000e+00f;
+  for (int rc_outer_outer = 0; rc_outer_outer &lt; 32; ++rc_outer_outer) {
+    for (int ry_outer_outer = 0; ry_outer_outer &lt; 3; ++ry_outer_outer) {
+      __syncthreads();
+      pad_temp_shared[((int)threadIdx.x)] = ((((1 &lt;= ((((int)threadIdx.x) / 9) + ry_outer_outer)) &amp;&amp; (1 &lt;= (((int)threadIdx.x) % 9))) &amp;&amp; ((((int)threadIdx.x) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 28)] = (((((((((int)threadIdx.x) + 28) / 9) + ry_outer_outer) &lt; 8) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 1) % 9))) &amp;&amp; (((((int)threadIdx.x) + 1) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 28) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 56)] = (((((1 &lt;= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 2) % 9))) &amp;&amp; (((((int)threadIdx.x) + 2) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 56) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 84)] = (((1 &lt;= ((((int)threadIdx.x) + 3) % 9)) &amp;&amp; (((((int)threadIdx.x) + 3) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 84) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 112)] = (((((1 &lt;= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 4) % 9))) &amp;&amp; (((((int)threadIdx.x) + 4) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 112) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 140)] = (((1 &lt;= ((((int)threadIdx.x) + 5) % 9)) &amp;&amp; (((((int)threadIdx.x) + 5) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 140) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 168)] = (((((1 &lt;= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 6) % 9))) &amp;&amp; (((((int)threadIdx.x) + 6) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 168) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 196)] = ((((1 &lt;= (((((int)threadIdx.x) + 7) / 9) + ry_outer_outer)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 7) % 9))) &amp;&amp; (((((int)threadIdx.x) + 7) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 196) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 224)] = (((((((((int)threadIdx.x) + 35) / 9) + ry_outer_outer) &lt; 8) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 8) % 9))) &amp;&amp; (((((int)threadIdx.x) + 8) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 224) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 252)] = ((((1 &lt;= ((((int)threadIdx.x) / 9) + ry_outer_outer)) &amp;&amp; (1 &lt;= (((int)threadIdx.x) % 9))) &amp;&amp; ((((int)threadIdx.x) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) + 188)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 280)] = (((((((((int)threadIdx.x) + 28) / 9) + ry_outer_outer) &lt; 8) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 1) % 9))) &amp;&amp; (((((int)threadIdx.x) + 1) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 280) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 308)] = (((((1 &lt;= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 2) % 9))) &amp;&amp; (((((int)threadIdx.x) + 2) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 308) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 336)] = (((1 &lt;= ((((int)threadIdx.x) + 3) % 9)) &amp;&amp; (((((int)threadIdx.x) + 3) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 336) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 364)] = (((((1 &lt;= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 4) % 9))) &amp;&amp; (((((int)threadIdx.x) + 4) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 364) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 392)] = (((1 &lt;= ((((int)threadIdx.x) + 5) % 9)) &amp;&amp; (((((int)threadIdx.x) + 5) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 392) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 420)] = (((((1 &lt;= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 6) % 9))) &amp;&amp; (((((int)threadIdx.x) + 6) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 420) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 448)] = ((((1 &lt;= (((((int)threadIdx.x) + 7) / 9) + ry_outer_outer)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 7) % 9))) &amp;&amp; (((((int)threadIdx.x) + 7) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 448) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 476)] = (((((((((int)threadIdx.x) + 35) / 9) + ry_outer_outer) &lt; 8) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 8) % 9))) &amp;&amp; (((((int)threadIdx.x) + 8) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 476) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 504)] = ((((1 &lt;= ((((int)threadIdx.x) / 9) + ry_outer_outer)) &amp;&amp; (1 &lt;= (((int)threadIdx.x) % 9))) &amp;&amp; ((((int)threadIdx.x) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) + 384)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 532)] = (((((((((int)threadIdx.x) + 28) / 9) + ry_outer_outer) &lt; 8) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 1) % 9))) &amp;&amp; (((((int)threadIdx.x) + 1) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 532) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 560)] = (((((1 &lt;= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 2) % 9))) &amp;&amp; (((((int)threadIdx.x) + 2) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 560) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 588)] = (((1 &lt;= ((((int)threadIdx.x) + 3) % 9)) &amp;&amp; (((((int)threadIdx.x) + 3) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 588) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 616)] = (((((1 &lt;= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 4) % 9))) &amp;&amp; (((((int)threadIdx.x) + 4) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 616) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 644)] = (((1 &lt;= ((((int)threadIdx.x) + 5) % 9)) &amp;&amp; (((((int)threadIdx.x) + 5) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 644) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 672)] = (((((1 &lt;= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 6) % 9))) &amp;&amp; (((((int)threadIdx.x) + 6) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 672) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 700)] = ((((1 &lt;= (((((int)threadIdx.x) + 7) / 9) + ry_outer_outer)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 7) % 9))) &amp;&amp; (((((int)threadIdx.x) + 7) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 700) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 728)] = (((((((((int)threadIdx.x) + 35) / 9) + ry_outer_outer) &lt; 8) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 8) % 9))) &amp;&amp; (((((int)threadIdx.x) + 8) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 728) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 756)] = ((((1 &lt;= ((((int)threadIdx.x) / 9) + ry_outer_outer)) &amp;&amp; (1 &lt;= (((int)threadIdx.x) % 9))) &amp;&amp; ((((int)threadIdx.x) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 9) * 7)) + (ry_outer_outer * 7)) + (((int)threadIdx.x) % 9)) + 580)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 784)] = (((((((((int)threadIdx.x) + 28) / 9) + ry_outer_outer) &lt; 8) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 1) % 9))) &amp;&amp; (((((int)threadIdx.x) + 1) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 784) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 812)] = (((((1 &lt;= ((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 56) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 2) % 9))) &amp;&amp; (((((int)threadIdx.x) + 2) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 812) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 840)] = (((1 &lt;= ((((int)threadIdx.x) + 3) % 9)) &amp;&amp; (((((int)threadIdx.x) + 3) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 840) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 868)] = (((((1 &lt;= ((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 49) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 4) % 9))) &amp;&amp; (((((int)threadIdx.x) + 4) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 868) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 896)] = (((1 &lt;= ((((int)threadIdx.x) + 5) % 9)) &amp;&amp; (((((int)threadIdx.x) + 5) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 896) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 924)] = (((((1 &lt;= ((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer)) &amp;&amp; (((((((int)threadIdx.x) + 42) % 63) / 9) + ry_outer_outer) &lt; 8)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 6) % 9))) &amp;&amp; (((((int)threadIdx.x) + 6) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 924) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 952)] = ((((1 &lt;= (((((int)threadIdx.x) + 7) / 9) + ry_outer_outer)) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 7) % 9))) &amp;&amp; (((((int)threadIdx.x) + 7) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 952) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
+      pad_temp_shared[(((int)threadIdx.x) + 980)] = (((((((((int)threadIdx.x) + 35) / 9) + ry_outer_outer) &lt; 8) &amp;&amp; (1 &lt;= ((((int)threadIdx.x) + 8) % 9))) &amp;&amp; (((((int)threadIdx.x) + 8) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 980) / 9) * 7)) + (ry_outer_outer * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
+      kernel_shared[((int)threadIdx.x)] = kernel[(((((((int)blockIdx.x) * 73728) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 28)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 28) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 28) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 56)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 56) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 8) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 84)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 84) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) / 3) + 12) &amp; 15) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 112)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 112) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 16) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 140)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 140) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 44) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 168)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 168) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) / 3) + 8) &amp; 15) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 196)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 196) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 4) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 224)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 224) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 32) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 252)] = kernel[(((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 252) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 36)];
+      kernel_shared[(((int)threadIdx.x) + 280)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 280) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 40) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 308)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 308) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 20) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 336)] = kernel[((((((((int)blockIdx.x) * 73728) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 32256)];
+      kernel_shared[(((int)threadIdx.x) + 364)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 364) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 28) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 392)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 392) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 8) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 420)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 420) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) / 3) + 12) &amp; 15) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 448)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 448) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 16) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 476)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 476) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 44) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 504)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 504) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) / 3) + 8) &amp; 15) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 532)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 532) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 4) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 560)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 560) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 32) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 588)] = kernel[(((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 588) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 36)];
+      kernel_shared[(((int)threadIdx.x) + 616)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 616) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 40) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 644)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 644) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 20) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 672)] = kernel[((((((((int)blockIdx.x) * 73728) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 64512)];
+      kernel_shared[(((int)threadIdx.x) + 700)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 700) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 28) % 48) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 1) % 3))];
+      kernel_shared[(((int)threadIdx.x) + 728)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 728) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 8) / 3) * 9)) + (ry_outer_outer * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+      if (((int)threadIdx.x) &lt; 12) {
+        kernel_shared[(((int)threadIdx.x) + 756)] = kernel[(((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 756) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) / 3) * 9)) + (ry_outer_outer * 3)) + (((int)threadIdx.x) % 3)) + 108)];
       }
-    }
-    for (int ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1 = 0; ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1 &lt; 9; ++ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1) {
-      *(float2*)(kernel_shared + ((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1 * 64) + (((int)threadIdx.x) * 2))) = *(float2*)(kernel + (((((((int)blockIdx.x) / 7) * 147456) + ((((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1 * 64) + (((int)threadIdx.x) * 2)) / 18) * 4608)) + (rc_outer_outer * 18)) + (((ax0_ax1_fused_ax2_fused_ax3_fused_outer_outer1 * 10) + (((int)threadIdx.x) * 2)) % 18)));
-    }
-    __syncthreads();
-    for (int rc_inner = 0; rc_inner &lt; 2; ++rc_inner) {
-      for (int ry_inner = 0; ry_inner &lt; 3; ++ry_inner) {
-        for (int rx_inner = 0; rx_inner &lt; 3; ++rx_inner) {
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_inner * 27) + (ry_inner * 9)) + rx_inner)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 1)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-          conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 2)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-          conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 3)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-          conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 4)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-          conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 5)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-          conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((((rc_inner * 27) + (ry_inner * 9)) + rx_inner) + 6)] * kernel_shared[((((((int)threadIdx.x) * 18) + (rc_inner * 9)) + (ry_inner * 3)) + rx_inner)]));
-        }
+      __syncthreads();
+      for (int rc_outer_inner = 0; rc_outer_inner &lt; 16; ++rc_outer_inner) {
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((rc_outer_inner * 63) + (((int)threadIdx.x) % 7))] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+        conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((rc_outer_inner * 63) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+        conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+        conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 27)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+        conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 27)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+        conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 36)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+        conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 36)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+        conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 45)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+        conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 45)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+        conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 54)] * kernel_shared[(((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3))]));
+        conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 54)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 48)]));
+        conv2d_nchw[14] = (conv2d_nchw[14] + (pad_temp_shared[((rc_outer_inner * 63) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+        conv2d_nchw[21] = (conv2d_nchw[21] + (pad_temp_shared[((rc_outer_inner * 63) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+        conv2d_nchw[15] = (conv2d_nchw[15] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+        conv2d_nchw[22] = (conv2d_nchw[22] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+        conv2d_nchw[16] = (conv2d_nchw[16] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+        conv2d_nchw[23] = (conv2d_nchw[23] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+        conv2d_nchw[17] = (conv2d_nchw[17] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 27)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+        conv2d_nchw[24] = (conv2d_nchw[24] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 27)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+        conv2d_nchw[18] = (conv2d_nchw[18] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 36)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+        conv2d_nchw[25] = (conv2d_nchw[25] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 36)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+        conv2d_nchw[19] = (conv2d_nchw[19] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 45)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+        conv2d_nchw[26] = (conv2d_nchw[26] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 45)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+        conv2d_nchw[20] = (conv2d_nchw[20] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 54)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 96)]));
+        conv2d_nchw[27] = (conv2d_nchw[27] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 54)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 144)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+        conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+        conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+        conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 28)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+        conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 28)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+        conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 37)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+        conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 37)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+        conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 46)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+        conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 46)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+        conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 55)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 1)]));
+        conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 55)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 49)]));
+        conv2d_nchw[14] = (conv2d_nchw[14] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+        conv2d_nchw[21] = (conv2d_nchw[21] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+        conv2d_nchw[15] = (conv2d_nchw[15] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+        conv2d_nchw[22] = (conv2d_nchw[22] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+        conv2d_nchw[16] = (conv2d_nchw[16] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+        conv2d_nchw[23] = (conv2d_nchw[23] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+        conv2d_nchw[17] = (conv2d_nchw[17] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 28)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+        conv2d_nchw[24] = (conv2d_nchw[24] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 28)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+        conv2d_nchw[18] = (conv2d_nchw[18] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 37)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+        conv2d_nchw[25] = (conv2d_nchw[25] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 37)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+        conv2d_nchw[19] = (conv2d_nchw[19] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 46)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+        conv2d_nchw[26] = (conv2d_nchw[26] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 46)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+        conv2d_nchw[20] = (conv2d_nchw[20] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 55)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 97)]));
+        conv2d_nchw[27] = (conv2d_nchw[27] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 55)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 145)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+        conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+        conv2d_nchw[8] = (conv2d_nchw[8] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+        conv2d_nchw[9] = (conv2d_nchw[9] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 29)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+        conv2d_nchw[10] = (conv2d_nchw[10] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 29)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+        conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 38)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+        conv2d_nchw[11] = (conv2d_nchw[11] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 38)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+        conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 47)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+        conv2d_nchw[12] = (conv2d_nchw[12] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 47)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+        conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 56)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 2)]));
+        conv2d_nchw[13] = (conv2d_nchw[13] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 56)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 50)]));
+        conv2d_nchw[14] = (conv2d_nchw[14] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+        conv2d_nchw[21] = (conv2d_nchw[21] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+        conv2d_nchw[15] = (conv2d_nchw[15] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+        conv2d_nchw[22] = (conv2d_nchw[22] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+        conv2d_nchw[16] = (conv2d_nchw[16] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+        conv2d_nchw[23] = (conv2d_nchw[23] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+        conv2d_nchw[17] = (conv2d_nchw[17] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 29)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+        conv2d_nchw[24] = (conv2d_nchw[24] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 29)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+        conv2d_nchw[18] = (conv2d_nchw[18] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 38)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+        conv2d_nchw[25] = (conv2d_nchw[25] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 38)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+        conv2d_nchw[19] = (conv2d_nchw[19] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 47)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+        conv2d_nchw[26] = (conv2d_nchw[26] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 47)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
+        conv2d_nchw[20] = (conv2d_nchw[20] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 56)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 98)]));
+        conv2d_nchw[27] = (conv2d_nchw[27] + (pad_temp_shared[(((rc_outer_inner * 63) + (((int)threadIdx.x) % 7)) + 56)] * kernel_shared[((((((int)threadIdx.x) / 7) * 192) + (rc_outer_inner * 3)) + 146)]));
       }
     }
   }
-  compute[((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7))] = max((conv2d_nchw[0] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-  compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 1)] = max((conv2d_nchw[1] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-  compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 2)] = max((conv2d_nchw[2] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-  compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 3)] = max((conv2d_nchw[3] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-  compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 4)] = max((conv2d_nchw[4] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-  compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 5)] = max((conv2d_nchw[5] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
-  compute[(((((((int)blockIdx.x) / 7) * 1568) + (((int)threadIdx.x) * 49)) + ((((int)blockIdx.x) % 7) * 7)) + 6)] = max((conv2d_nchw[6] + bias[(((((int)blockIdx.x) / 7) * 32) + ((int)threadIdx.x))]), 0.000000e+00f);
+  for (int i1_inner = 0; i1_inner &lt; 4; ++i1_inner) {
+    for (int i2_inner = 0; i2_inner &lt; 7; ++i2_inner) {
+      compute[(((((((int)blockIdx.x) * 784) + ((((int)threadIdx.x) / 7) * 196)) + (i1_inner * 49)) + (i2_inner * 7)) + (((int)threadIdx.x) % 7))] = max((conv2d_nchw[((i1_inner * 7) + i2_inner)] + bias[(((((int)blockIdx.x) * 16) + ((((int)threadIdx.x) / 7) * 4)) + i1_inner)]), 0.000000e+00f);
+    }
+  }
 }
 </pre></div>
 </div>
@@ -752,7 +1118,7 @@ In the example below we resume the status and do more 5 trials.</p>
 Get devices for measurement successfully!
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  46.458 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  57.713 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autoscheduler-tune-conv2d-layer-cuda-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tune_conv2d_layer_cuda.py</span></code></a></p>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html b/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
index 3c99febf3..50556c35d 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
@@ -901,7 +901,7 @@ so we can read the log file and load the best schedules.</p>
 Evaluate inference time cost...
 Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-   9.8330       9.8493       9.8599       9.7896       0.0310
+   9.7912       9.7954       9.8093       9.7689       0.0167
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_network_x86.html b/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
index d2f776c5b..fd7737f46 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
@@ -920,7 +920,7 @@ so we can read the log file and load the best schedules.</p>
 Evaluate inference time cost...
 Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  761.5104     760.6213     764.0922     759.8176      1.8549
+  759.4694     759.6805     760.8415     757.8861      1.2157
 </pre></div>
 </div>
 </div>
@@ -942,7 +942,7 @@ to learn how to use the RPC Tracker and RPC Server.
 To use the RPC Tracker in auto-scheduler, replace the runner in <code class="code docutils literal notranslate"><span class="pre">TuningOptions</span></code>
 with <a class="reference internal" href="../../reference/api/python/auto_scheduler.html#tvm.auto_scheduler.RPCRunner" title="tvm.auto_scheduler.RPCRunner"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.RPCRunner</span></code></a>.</p></li>
 </ol>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  20.837 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  20.722 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autoscheduler-tune-network-x86-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/e416b94ca1090b0897c0f6e0df95b911/tune_network_x86.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tune_network_x86.py</span></code></a></p>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html b/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
index 044d51e17..a54a1a0b6 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
@@ -620,104 +620,32 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
              placeholder_4: Buffer(placeholder_14: Pointer(float32), float32, [65536], []),
              compute: Buffer(compute_2: Pointer(float32), float32, [65536], [])}
   buffer_map = {placeholder_5: placeholder, placeholder_6: placeholder_1, placeholder_7: placeholder_2, placeholder_8: placeholder_3, placeholder_9: placeholder_4, compute_1: compute}
-  preflattened_buffer_map = {compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_7: placeholder_15: Buffer(placeholder_12, int32, [4916], []), placeholder_5: placeholder_16: Buffer(placeholder_10, float32, [128, 256], []), placeholder_6: placeholder_17: Buffer(placeholder_11, float32, [4916, 16, 1], []), placeholder_8: placeholder_18: Buffer(placeholder_13, int32, [33], []), placeholder_9: placeholder_19: Buffer(placeholder_14, float32, [128, 512], [])} {
-  for (i0.outer.i1.outer.fused: int32, 0, 64) &quot;parallel&quot; {
-    allocate(compute_4: Pointer(global float32), float32, [1024]), storage_scope = global {
-      for (i.inner.init: int32, 0, 64) {
-        let cse_var_1: int32 = (i.inner.init*16)
-         {
-          compute_5: Buffer(compute_4, float32, [1024], [])[cse_var_1] = 0f32
-          compute_5[(cse_var_1 + 1)] = 0f32
-          compute_5[(cse_var_1 + 2)] = 0f32
-          compute_5[(cse_var_1 + 3)] = 0f32
-          compute_5[(cse_var_1 + 4)] = 0f32
-          compute_5[(cse_var_1 + 5)] = 0f32
-          compute_5[(cse_var_1 + 6)] = 0f32
-          compute_5[(cse_var_1 + 7)] = 0f32
-          compute_5[(cse_var_1 + 8)] = 0f32
-          compute_5[(cse_var_1 + 9)] = 0f32
-          compute_5[(cse_var_1 + 10)] = 0f32
-          compute_5[(cse_var_1 + 11)] = 0f32
-          compute_5[(cse_var_1 + 12)] = 0f32
-          compute_5[(cse_var_1 + 13)] = 0f32
-          compute_5[(cse_var_1 + 14)] = 0f32
-          compute_5[(cse_var_1 + 15)] = 0f32
-        }
-      }
-      for (elem_idx: int32, 0, let cse_var_2: int32 = floormod(i0.outer.i1.outer.fused, 32) in (placeholder_3[(cse_var_2 + 1)] - placeholder_3[cse_var_2])) {
-        for (i.inner: int32, 0, 64) {
-          let cse_var_3: int32 = floormod(i0.outer.i1.outer.fused, 32)
-           {
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_4: int32 = (i.inner*16)
-              compute_5[cse_var_4] = (compute_5[cse_var_4] + (placeholder_1[((placeholder_3[cse_var_3]*16) + (elem_idx*16))]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_5: int32 = ((i.inner*16) + 1)
-              compute_5[cse_var_5] = (compute_5[cse_var_5] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 1)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_6: int32 = ((i.inner*16) + 2)
-              compute_5[cse_var_6] = (compute_5[cse_var_6] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 2)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_7: int32 = ((i.inner*16) + 3)
-              compute_5[cse_var_7] = (compute_5[cse_var_7] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 3)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_8: int32 = ((i.inner*16) + 4)
-              compute_5[cse_var_8] = (compute_5[cse_var_8] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 4)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_9: int32 = ((i.inner*16) + 5)
-              compute_5[cse_var_9] = (compute_5[cse_var_9] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 5)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_10: int32 = ((i.inner*16) + 6)
-              compute_5[cse_var_10] = (compute_5[cse_var_10] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 6)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_11: int32 = ((i.inner*16) + 7)
-              compute_5[cse_var_11] = (compute_5[cse_var_11] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 7)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
+  preflattened_buffer_map = {placeholder_5: placeholder_15: Buffer(placeholder_10, float32, [128, 256], []), placeholder_8: placeholder_16: Buffer(placeholder_13, int32, [33], []), placeholder_7: placeholder_17: Buffer(placeholder_12, int32, [4916], []), placeholder_6: placeholder_18: Buffer(placeholder_11, float32, [4916, 16, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_9: placeholder_19: Buffer(placeholder_14, float32, [128, 512], [])} {
+  for (i0.outer.i1.outer.fused: int32, 0, 16) &quot;parallel&quot; {
+    allocate(compute_4: Pointer(global float32), float32, [4096]), storage_scope = global {
+      for (i.outer.inner: int32, 0, 8) {
+        for (nb_j.inner: int32, 0, 2) {
+          for (i.inner.init: int32, 0, 16) {
+            for (j.init: int32, 0, 16) {
+              compute_5: Buffer(compute_4, float32, [4096], [])[((((i.outer.inner*512) + (i.inner.init*32)) + (nb_j.inner*16)) + j.init)] = 0f32
             }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_12: int32 = ((i.inner*16) + 8)
-              compute_5[cse_var_12] = (compute_5[cse_var_12] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 8)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_13: int32 = ((i.inner*16) + 9)
-              compute_5[cse_var_13] = (compute_5[cse_var_13] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 9)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_14: int32 = ((i.inner*16) + 10)
-              compute_5[cse_var_14] = (compute_5[cse_var_14] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 10)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_15: int32 = ((i.inner*16) + 11)
-              compute_5[cse_var_15] = (compute_5[cse_var_15] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 11)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_16: int32 = ((i.inner*16) + 12)
-              compute_5[cse_var_16] = (compute_5[cse_var_16] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 12)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_17: int32 = ((i.inner*16) + 13)
-              compute_5[cse_var_17] = (compute_5[cse_var_17] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 13)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_18: int32 = ((i.inner*16) + 14)
-              compute_5[cse_var_18] = (compute_5[cse_var_18] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 14)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
-            }
-            if @tir.likely((elem_idx &lt; (placeholder_3[(cse_var_3 + 1)] - placeholder_3[cse_var_3])), dtype=bool) {
-              let cse_var_19: int32 = ((i.inner*16) + 15)
-              compute_5[cse_var_19] = (compute_5[cse_var_19] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + 15)]*max(placeholder[(((floordiv(i0.outer.i1.outer.fused, 32)*16384) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
+          }
+          for (elem_idx: int32, 0, let cse_var_1: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner) in (placeholder_3[(cse_var_1 + 1)] - placeholder_3[cse_var_1])) {
+            for (i.inner: int32, 0, 16) {
+              for (j: int32, 0, 16) {
+                let cse_var_3: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner)
+                let cse_var_2: int32 = ((((i.outer.inner*512) + (i.inner*32)) + (nb_j.inner*16)) + j)
+                compute_5[cse_var_2] = (compute_5[cse_var_2] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + j)]*max(placeholder[(((i.outer.inner*4096) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
+              }
             }
           }
         }
       }
-      for (i0.inner: int32, 0, 64) {
-        let cse_var_20: int32 = (((floordiv(i0.outer.i1.outer.fused, 32)*32768) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 32)*16))
-        compute[ramp(cse_var_20, 1, 16)] = max((compute_5[ramp((i0.inner*16), 1, 16)] + placeholder_4[ramp(cse_var_20, 1, 16)]), broadcast(0f32, 16))
+      for (i0.inner: int32, 0, 128) {
+        for (i1.inner: int32, 0, 32) {
+          let cse_var_4: int32 = (((i0.inner*512) + (i0.outer.i1.outer.fused*32)) + i1.inner)
+          compute[cse_var_4] = max((compute_5[((i0.inner*32) + i1.inner)] + placeholder_4[cse_var_4]), 0f32)
+        }
       }
     }
   }
@@ -755,7 +683,7 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
 <span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 1.849 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 1.517 ms
 </pre></div>
 </div>
 <div class="admonition note">
diff --git a/docs/how_to/tune_with_autotvm/sg_execution_times.html b/docs/how_to/tune_with_autotvm/sg_execution_times.html
index 2120ab931..231ed960a 100644
--- a/docs/how_to/tune_with_autotvm/sg_execution_times.html
+++ b/docs/how_to/tune_with_autotvm/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-tune-with-autotvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:43.518</strong> total execution time for <strong>how_to_tune_with_autotvm</strong> files:</p>
+<p><strong>00:44.271</strong> total execution time for <strong>how_to_tune_with_autotvm</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 84%" />
@@ -331,11 +331,11 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_conv2d_cuda.html#sphx-glr-how-to-tune-with-autotvm-tune-conv2d-cuda-py"><span class="std std-ref">Tuning High Performance Convolution on NVIDIA GPUs</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_cuda.py</span></code>)</p></td>
-<td><p>00:43.483</p></td>
+<td><p>00:44.240</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tune_relay_x86.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-x86-py"><span class="std std-ref">Auto-tuning a Convolutional Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_x86.py</span></code>)</p></td>
-<td><p>00:00.021</p></td>
+<td><p>00:00.016</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_relay_cuda.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-cuda-py"><span class="std std-ref">Auto-tuning a Convolutional Network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_cuda.py</span></code>)</p></td>
diff --git a/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html b/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
index 27072171c..13de9a2bd 100644
--- a/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
+++ b/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
@@ -1167,8 +1167,8 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 4, 4, 32]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 1, 128]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 0)],None,2885496
-No: 6   GFLOPS: 42.43/42.43     result: MeasureResult(costs=(0.005456700842105263,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6141278743743896, timestamp=1656635456.932159)        [(&#39;tile_f&#39;, [-1, 1, 1, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 4, 4]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,3754080
-No: 7   GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 6   GFLOPS: 109.70/109.70   result: MeasureResult(costs=(0.0021103945208333333,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6624927520751953, timestamp=1656695361.7349436)      [(&#39;tile_f&#39;, [-1, 1, 1, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 4, 4]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,3754080
+No: 7   GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -1291,7 +1291,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 1, 16, 32]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 256, 1]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,6225319
-No: 8   GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 8   GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -1414,7 +1414,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 2, 1, 32]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 8, 64]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 0)],None,943546
-No: 9   GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 9   GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -1537,7 +1537,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 4, 16, 4]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 1, 1, 7]), (&#39;tile_rc&#39;, [-1, 16, 32]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 0)],None,2868708
-No: 10  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 10  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 142, in build
     res = future.result()
   File &quot;/usr/lib/python3.7/concurrent/futures/_base.py&quot;, line 435, in result
@@ -1555,7 +1555,7 @@ No: 10  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
 TimeoutError
 
         [(&#39;tile_f&#39;, [-1, 32, 2, 4]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 7]), (&#39;tile_rc&#39;, [-1, 4, 2]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,4691833
-No: 11  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 11  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -1678,7 +1678,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 1, 2, 64]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 4]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 0)],None,1042124
-No: 12  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 12  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -1801,7 +1801,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 32, 1, 4]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 32, 16]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,10013405
-No: 13  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 13  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -1924,7 +1924,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 8, 8, 2]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 4, 32]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,6732082
-No: 14  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 14  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -2047,7 +2047,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 2, 4, 32]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 128]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 1)],None,7536735
-No: 15  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 15  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -2170,7 +2170,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 2, 1, 4]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 1, 1, 7]), (&#39;tile_rc&#39;, [-1, 128, 4]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 0)],None,482121
-No: 16  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 16  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -2293,7 +2293,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 2, 1, 16]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 32, 8]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 0)],None,2824525
-No: 17  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 17  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -2416,7 +2416,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 64, 1, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 8, 8]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,4559286
-No: 18  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 18  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -2539,7 +2539,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 1, 32, 16]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 1, 512]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,9677544
-No: 19  GFLOPS: 0.00/42.43      result: Traceback (most recent call last):
+No: 19  GFLOPS: 0.00/109.70     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 738, in __call__
     yield remote, remote.load_module(os.path.split(build_result.filename)[1])
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 702, in run_through_rpc
@@ -2627,7 +2627,7 @@ tvm._ffi.base.TVMError: Traceback (most recent call last):
   15: _PyEval_EvalFrameDefault
   14: 0x0000000000537c30
   13: _PyObject_FastCallKeywords
-  12: 0x00007f03bd9f3fa2
+  12: 0x00007f937a5adfa2
   11: _ctypes_callproc
   10: ffi_call
   9: ffi_call_unix64
@@ -2692,7 +2692,7 @@ Traceback (most recent call last):
   21: _PyFunction_FastCallKeywords
   20: _PyEval_EvalFrameDefault
   19: _PyFunction_FastCall      [(&#39;tile_f&#39;, [-1, 8, 2, 16]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 1, 1]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,6390073
-No: 20  GFLOPS: 143.77/143.77   result: MeasureResult(costs=(0.00161018918,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4295034408569336, timestamp=1656635483.5630622)      [(&#39;tile_f&#39;, [-1, 1, 4, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 1]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,9881539
+No: 20  GFLOPS: 143.57/143.57   result: MeasureResult(costs=(0.00161243103,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.437474250793457, timestamp=1656695388.2929761)       [(&#39;tile_f&#39;, [-1, 1, 4, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 1]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,9881539
 </pre></div>
 </div>
 <p>Finally we can inspect the best config from log file, check correctness,
@@ -2733,7 +2733,7 @@ and measure running time.</p>
 Best config:
 [(&#39;tile_f&#39;, [-1, 1, 4, 1]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 1]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,9881539
 Finish loading 20 records
-Time cost of this operator: 0.002028
+Time cost of this operator: 0.002060
 </pre></div>
 </div>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autotvm-tune-conv2d-cuda-py">
diff --git a/docs/how_to/work_with_microtvm/micro_autotune.html b/docs/how_to/work_with_microtvm/micro_autotune.html
index 157e11c22..7b0563ac1 100644
--- a/docs/how_to/work_with_microtvm/micro_autotune.html
+++ b/docs/how_to/work_with_microtvm/micro_autotune.html
@@ -578,10 +578,10 @@ the tuned operator.</p>
 ########## Build without Autotuning ##########
 Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)
 ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------
-tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  312.9     98.726   (1, 2, 10, 10, 3)  2       1        [312.9]
-tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.074     0.97     (1, 6, 10, 10)     1       1        [3.074]
-tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.963     0.304    (1, 1, 10, 10, 3)  1       1        [0.963]
-Total_time                                    -                                             316.937   -        -                  -       -        -
+tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  312.7     98.733   (1, 2, 10, 10, 3)  2       1        [312.7]
+tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.036     0.959    (1, 6, 10, 10)     1       1        [3.036]
+tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.977     0.309    (1, 1, 10, 10, 3)  1       1        [0.977]
+Total_time                                    -                                             316.713   -        -                  -       -        -
 </pre></div>
 </div>
 </div>
@@ -634,10 +634,10 @@ Total_time                                    -
 ########## Build with Autotuning ##########
 Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)
 ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------
-tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  79.25     96.609   (1, 6, 10, 10, 1)  2       1        [79.25]
-tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.798     2.191    (1, 6, 10, 10)     1       1        [1.798]
-tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.984     1.199    (1, 1, 10, 10, 3)  1       1        [0.984]
-Total_time                                    -                                             82.031    -        -                  -       -        -
+tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  189.5     98.416   (1, 1, 10, 10, 6)  2       1        [189.5]
+tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       2.202     1.144    (1, 6, 10, 10)     1       1        [2.202]
+tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.847     0.44     (1, 3, 10, 10, 1)  1       1        [0.847]
+Total_time                                    -                                             192.55    -        -                  -       -        -
 </pre></div>
 </div>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-work-with-microtvm-micro-autotune-py">
diff --git a/docs/how_to/work_with_microtvm/micro_train.html b/docs/how_to/work_with_microtvm/micro_train.html
index 0a5d4c984..c45e54e80 100644
--- a/docs/how_to/work_with_microtvm/micro_train.html
+++ b/docs/how_to/work_with_microtvm/micro_train.html
@@ -510,7 +510,7 @@ take about <strong>2 minutes</strong> to download the Stanford Cars, while COCO
 <a href="https://docs.python.org/3/library/shutil.html#shutil.move" title="shutil.move" class="sphx-glr-backref-module-shutil sphx-glr-backref-type-py-function"><span class="n">shutil</span><span class="o">.</span><span class="n">move</span></a><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-typ [...]
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>&#39;/tmp/tmpdii8xw_t/images/random&#39;
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>&#39;/tmp/tmplvgsgyrm/images/random&#39;
 </pre></div>
 </div>
 </div>
@@ -570,8 +570,8 @@ objects to other stuff? We can display some examples from our datasets using <co
     <span class="n">plt</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s2">&quot;off&quot;</span><span class="p">)</span>
 </pre></div>
 </div>
-<img src="../../_images/sphx_glr_micro_train_001.png" srcset="../../_images/sphx_glr_micro_train_001.png" alt="[1.0, 0.0], [1.0, 0.0], [1.0, 0.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0]" class = "sphx-glr-single-img"/><div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/tmp/tmpdii8xw_t/images/target contains 8144 images
-/tmp/tmpdii8xw_t/images/random contains 5000 images
+<img src="../../_images/sphx_glr_micro_train_001.png" srcset="../../_images/sphx_glr_micro_train_001.png" alt="[1.0, 0.0], [1.0, 0.0], [1.0, 0.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0]" class = "sphx-glr-single-img"/><div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/tmp/tmplvgsgyrm/images/target contains 8144 images
+/tmp/tmplvgsgyrm/images/random contains 5000 images
 </pre></div>
 </div>
 </div>
@@ -683,13 +683,13 @@ the time on our validation set).</p>
 </pre></div>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Epoch 1/3
-328/328 - 55s - loss: 0.2267 - accuracy: 0.9217 - val_loss: 0.1547 - val_accuracy: 0.9520
+328/328 - 55s - loss: 0.2185 - accuracy: 0.9259 - val_loss: 0.1491 - val_accuracy: 0.9588
 Epoch 2/3
-328/328 - 52s - loss: 0.0968 - accuracy: 0.9640 - val_loss: 0.1113 - val_accuracy: 0.9622
+328/328 - 52s - loss: 0.0952 - accuracy: 0.9644 - val_loss: 0.1322 - val_accuracy: 0.9585
 Epoch 3/3
-328/328 - 52s - loss: 0.0648 - accuracy: 0.9764 - val_loss: 0.1232 - val_accuracy: 0.9634
+328/328 - 53s - loss: 0.0676 - accuracy: 0.9751 - val_loss: 0.1259 - val_accuracy: 0.9603
 
-&lt;keras.callbacks.History object at 0x7f9561782710&gt;
+&lt;keras.callbacks.History object at 0x7f74b3d048d0&gt;
 </pre></div>
 </div>
 </div>
@@ -951,7 +951,7 @@ as intended.</p>
 <p>From here, we could modify the model to read live images from the camera - we have another
 Arduino tutorial for how to do that <a class="reference external" href="https://github.com/guberti/tvm-arduino-demos/tree/master/examples/person_detection">on GitHub</a>. Alternatively, we could also
 <a class="reference external" href="https://tvm.apache.org/docs/how_to/work_with_microtvm/micro_autotune.html">use TVM’s autotuning capabilities</a> to dramatically improve the model’s performance.</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 4 minutes  58.237 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 4 minutes  50.228 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-work-with-microtvm-micro-train-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/b52cec46baf4f78d6bcd94cbe269c8a6/micro_train.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">micro_train.py</span></code></a></p>
diff --git a/docs/how_to/work_with_microtvm/sg_execution_times.html b/docs/how_to/work_with_microtvm/sg_execution_times.html
index c631a6a0f..ee927f73a 100644
--- a/docs/how_to/work_with_microtvm/sg_execution_times.html
+++ b/docs/how_to/work_with_microtvm/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-work-with-microtvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>05:44.793</strong> total execution time for <strong>how_to_work_with_microtvm</strong> files:</p>
+<p><strong>05:36.577</strong> total execution time for <strong>how_to_work_with_microtvm</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 83%" />
@@ -331,15 +331,15 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="micro_train.html#sphx-glr-how-to-work-with-microtvm-micro-train-py"><span class="std std-ref">Training Vision Models for microTVM on Arduino</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_train.py</span></code>)</p></td>
-<td><p>04:58.237</p></td>
+<td><p>04:50.228</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="micro_autotune.html#sphx-glr-how-to-work-with-microtvm-micro-autotune-py"><span class="std std-ref">Autotuning with microTVM</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_autotune.py</span></code>)</p></td>
-<td><p>00:43.228</p></td>
+<td><p>00:43.073</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="micro_tflite.html#sphx-glr-how-to-work-with-microtvm-micro-tflite-py"><span class="std std-ref">microTVM with TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tflite.py</span></code>)</p></td>
-<td><p>00:03.326</p></td>
+<td><p>00:03.274</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="micro_ethosu.html#sphx-glr-how-to-work-with-microtvm-micro-ethosu-py"><span class="std std-ref">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_ethosu.py</span></code>)</p></td>
diff --git a/docs/how_to/work_with_relay/sg_execution_times.html b/docs/how_to/work_with_relay/sg_execution_times.html
index add75c6b8..d5e03266d 100644
--- a/docs/how_to/work_with_relay/sg_execution_times.html
+++ b/docs/how_to/work_with_relay/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-work-with-relay-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:11.718</strong> total execution time for <strong>how_to_work_with_relay</strong> files:</p>
+<p><strong>00:11.371</strong> total execution time for <strong>how_to_work_with_relay</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 83%" />
@@ -331,11 +331,11 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="using_external_lib.html#sphx-glr-how-to-work-with-relay-using-external-lib-py"><span class="std std-ref">Using External Libraries in Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_external_lib.py</span></code>)</p></td>
-<td><p>00:09.999</p></td>
+<td><p>00:09.841</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="build_gcn.html#sphx-glr-how-to-work-with-relay-build-gcn-py"><span class="std std-ref">Building a Graph Convolutional Network</span></a> (<code class="docutils literal notranslate"><span class="pre">build_gcn.py</span></code>)</p></td>
-<td><p>00:01.713</p></td>
+<td><p>00:01.524</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="using_relay_viz.html#sphx-glr-how-to-work-with-relay-using-relay-viz-py"><span class="std std-ref">Use Relay Visualizer to Visualize Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_relay_viz.py</span></code>)</p></td>
diff --git a/docs/how_to/work_with_schedules/intrin_math.html b/docs/how_to/work_with_schedules/intrin_math.html
index 598cefe19..cb35c0be7 100644
--- a/docs/how_to/work_with_schedules/intrin_math.html
+++ b/docs/how_to/work_with_schedules/intrin_math.html
@@ -517,7 +517,7 @@ The following example customizes CUDA lowering rule for <code class="code docuti
 <a href="../../reference/api/python/ir.html#tvm.ir.register_intrin_lowering" title="tvm.ir.register_intrin_lowering" class="sphx-glr-backref-module-tvm-ir sphx-glr-backref-type-py-function"><span class="n">register_intrin_lowering</span></a><span class="p">(</span><span class="s2">&quot;tir.exp&quot;</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="s2">&quot;cuda&quot;</span><span class="p">,</span> <span class="n">f</span><span class="o">= [...]
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>&lt;function my_cuda_math_rule at 0x7f94eb6f1cb0&gt;
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>&lt;function my_cuda_math_rule at 0x7f742b2438c0&gt;
 </pre></div>
 </div>
 <p>Register the rule to TVM with override option to override existing rule.
diff --git a/docs/how_to/work_with_schedules/sg_execution_times.html b/docs/how_to/work_with_schedules/sg_execution_times.html
index ba5685118..6686ae9c1 100644
--- a/docs/how_to/work_with_schedules/sg_execution_times.html
+++ b/docs/how_to/work_with_schedules/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-work-with-schedules-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:04.294</strong> total execution time for <strong>how_to_work_with_schedules</strong> files:</p>
+<p><strong>00:04.087</strong> total execution time for <strong>how_to_work_with_schedules</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 83%" />
@@ -331,19 +331,19 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="intrin_math.html#sphx-glr-how-to-work-with-schedules-intrin-math-py"><span class="std std-ref">Intrinsics and Math Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">intrin_math.py</span></code>)</p></td>
-<td><p>00:01.966</p></td>
+<td><p>00:01.891</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tensorize.html#sphx-glr-how-to-work-with-schedules-tensorize-py"><span class="std std-ref">Use Tensorize to Leverage Hardware Intrinsics</span></a> (<code class="docutils literal notranslate"><span class="pre">tensorize.py</span></code>)</p></td>
-<td><p>00:01.075</p></td>
+<td><p>00:00.976</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="reduction.html#sphx-glr-how-to-work-with-schedules-reduction-py"><span class="std std-ref">Reduction</span></a> (<code class="docutils literal notranslate"><span class="pre">reduction.py</span></code>)</p></td>
-<td><p>00:00.550</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="scan.html#sphx-glr-how-to-work-with-schedules-scan-py"><span class="std std-ref">Scan and Recurrent Kernel</span></a> (<code class="docutils literal notranslate"><span class="pre">scan.py</span></code>)</p></td>
+<td><p>00:00.526</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="scan.html#sphx-glr-how-to-work-with-schedules-scan-py"><span class="std std-ref">Scan and Recurrent Kernel</span></a> (<code class="docutils literal notranslate"><span class="pre">scan.py</span></code>)</p></td>
-<td><p>00:00.529</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="reduction.html#sphx-glr-how-to-work-with-schedules-reduction-py"><span class="std std-ref">Reduction</span></a> (<code class="docutils literal notranslate"><span class="pre">reduction.py</span></code>)</p></td>
+<td><p>00:00.520</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="extern_op.html#sphx-glr-how-to-work-with-schedules-extern-op-py"><span class="std std-ref">External Tensor Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">extern_op.py</span></code>)</p></td>
@@ -355,11 +355,11 @@
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="tedd.html#sphx-glr-how-to-work-with-schedules-tedd-py"><span class="std std-ref">Use Tensor Expression Debug Display (TEDD) for Visualization</span></a> (<code class="docutils literal notranslate"><span class="pre">tedd.py</span></code>)</p></td>
-<td><p>00:00.027</p></td>
+<td><p>00:00.026</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tuple_inputs.html#sphx-glr-how-to-work-with-schedules-tuple-inputs-py"><span class="std std-ref">Compute and Reduce with Tuple Inputs</span></a> (<code class="docutils literal notranslate"><span class="pre">tuple_inputs.py</span></code>)</p></td>
-<td><p>00:00.014</p></td>
+<td><p>00:00.013</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/how_to/work_with_schedules/tensorize.html b/docs/how_to/work_with_schedules/tensorize.html
index dfa41df7e..9ace734be 100644
--- a/docs/how_to/work_with_schedules/tensorize.html
+++ b/docs/how_to/work_with_schedules/tensorize.html
@@ -572,7 +572,7 @@ The importing needs to happen before the tensorized GEMV being executed.</p>
              C: Buffer(C_2: Pointer(float32), float32, [524288], [])}
   buffer_map = {A_1: A, B_1: B, C_1: C}
   preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 64], []), B_1: B_3: Buffer(B_2, float32, [512, 64], []), C_1: C_3: Buffer(C_2, float32, [1024, 512], [])} {
-  attr [IterVar(i: int32, (nullptr), &quot;DataPar&quot;, &quot;&quot;)] &quot;pragma_import_llvm&quot; = &quot;; ModuleID = &#39;/tmp/tmpe_ar4cnv/input0.cc&#39;\nsource_filename = \&quot;/tmp/tmpe_ar4cnv/input0.cc\&quot;\ntarget datalayout = \&quot;e-m:e-i64:64-f80:128-n8:16:32:64-S128\&quot;\ntarget triple = \&quot;x86_64-pc-linux-gnu\&quot;\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = allo [...]
+  attr [IterVar(i: int32, (nullptr), &quot;DataPar&quot;, &quot;&quot;)] &quot;pragma_import_llvm&quot; = &quot;; ModuleID = &#39;/tmp/tmpv2vuyfy1/input0.cc&#39;\nsource_filename = \&quot;/tmp/tmpv2vuyfy1/input0.cc\&quot;\ntarget datalayout = \&quot;e-m:e-i64:64-f80:128-n8:16:32:64-S128\&quot;\ntarget triple = \&quot;x86_64-pc-linux-gnu\&quot;\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = allo [...]
   for (i, 0, 1024) {
     for (j.outer: int32, 0, 32) {
       @tir.call_extern(&quot;gemv_update&quot;, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), C_2, ((i*512) + (j.outer*16)), 16, 2, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), A_2, (i*64), 64, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), B_2, (j.outer*1024), 1024, 1, dtype=handle), 16, 64, 64, dtype=int32)
diff --git a/docs/reference/api/doxygen/classtvm_1_1meta__schedule_1_1FeatureExtractor-members.html b/docs/reference/api/doxygen/classtvm_1_1meta__schedule_1_1FeatureExtractor-members.html
index 9d0baa253..baec75953 100644
--- a/docs/reference/api/doxygen/classtvm_1_1meta__schedule_1_1FeatureExtractor-members.html
+++ b/docs/reference/api/doxygen/classtvm_1_1meta__schedule_1_1FeatureExtractor-members.html
@@ -85,7 +85,7 @@ $(function() {
   <tr><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#a3deeeac5827a88f375b8c6ae1039c219">operator-&gt;</a>() const</td><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
   <tr class="even"><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#a4744bf4a1b48f202d41b51dc5e08e6ee">operator&lt;</a>(const ObjectRef &amp;other) const</td><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
   <tr><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#affdf1b8cdb36e140de7b3ad7064e4617">operator==</a>(const ObjectRef &amp;other) const</td><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
-  <tr class="even"><td class="entry"><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html#a70128f6b43830ee2a9a421efec900ba1">PerStoreFeature</a>(int buffers_per_store=5, int arith_intensity_curve_num_samples=10, int cache_line_bytes=64)</td><td class="entry"><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html">tvm::meta_schedule::FeatureExtractor</a></td><td class="entry"><span class="mlabel">static</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html#abbfc563425a975e026f2daf3bbfa86ee">PerStoreFeature</a>(int buffers_per_store=5, int arith_intensity_curve_num_samples=10, int cache_line_bytes=64, bool extract_workload=false)</td><td class="entry"><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html">tvm::meta_schedule::FeatureExtractor</a></td><td class="entry"><span class="mlabel">static</span></td></tr>
   <tr><td class="entry"><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html#ac4b355e78ec150c5d067f78638f2da82">PyFeatureExtractor</a>(PyFeatureExtractorNode::FExtractFrom f_extract_from, PyFeatureExtractorNode::FAsString f_as_string)</td><td class="entry"><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html">tvm::meta_schedule::FeatureExtractor</a></td><td class="entry"><span class="mlabel">static</span></td></tr>
   <tr class="even"><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html#ae31a5b9f40781d60a2901994ead700e8">same_as</a>(const ObjectRef &amp;other) const</td><td class="entry"><a class="el" href="classtvm_1_1runtime_1_1ObjectRef.html">tvm::runtime::ObjectRef</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
   <tr><td class="entry"><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html#a520ea139db1d08dccc22e323bb1742e8">TVM_DEFINE_MUTABLE_OBJECT_REF_METHODS</a>(FeatureExtractor, ObjectRef, FeatureExtractorNode)</td><td class="entry"><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html">tvm::meta_schedule::FeatureExtractor</a></td><td class="entry"></td></tr>
diff --git a/docs/reference/api/doxygen/classtvm_1_1meta__schedule_1_1FeatureExtractor.html b/docs/reference/api/doxygen/classtvm_1_1meta__schedule_1_1FeatureExtractor.html
index 49e225a08..37993b433 100644
--- a/docs/reference/api/doxygen/classtvm_1_1meta__schedule_1_1FeatureExtractor.html
+++ b/docs/reference/api/doxygen/classtvm_1_1meta__schedule_1_1FeatureExtractor.html
@@ -128,9 +128,9 @@ Public Member Functions</h2></td></tr>
 </table><table class="memberdecls">
 <tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="pub-static-methods"></a>
 Static Public Member Functions</h2></td></tr>
-<tr class="memitem:a70128f6b43830ee2a9a421efec900ba1"><td class="memItemLeft" align="right" valign="top">static <a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html">FeatureExtractor</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html#a70128f6b43830ee2a9a421efec900ba1">PerStoreFeature</a> (int buffers_per_store=5, int arith_intensity_curve_num_samples=10, int cache_line_bytes=64)</td></tr>
-<tr class="memdesc:a70128f6b43830ee2a9a421efec900ba1"><td class="mdescLeft">&#160;</td><td class="mdescRight">Create a feature extractor that extracts features from each BufferStore.  <a href="#a70128f6b43830ee2a9a421efec900ba1">More...</a><br /></td></tr>
-<tr class="separator:a70128f6b43830ee2a9a421efec900ba1"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:abbfc563425a975e026f2daf3bbfa86ee"><td class="memItemLeft" align="right" valign="top">static <a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html">FeatureExtractor</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html#abbfc563425a975e026f2daf3bbfa86ee">PerStoreFeature</a> (int buffers_per_store=5, int arith_intensity_curve_num_samples=10, int cache_line_bytes=64, bool extract_wo [...]
+<tr class="memdesc:abbfc563425a975e026f2daf3bbfa86ee"><td class="mdescLeft">&#160;</td><td class="mdescRight">Create a feature extractor that extracts features from each BufferStore.  <a href="#abbfc563425a975e026f2daf3bbfa86ee">More...</a><br /></td></tr>
+<tr class="separator:abbfc563425a975e026f2daf3bbfa86ee"><td class="memSeparator" colspan="2">&#160;</td></tr>
 <tr class="memitem:ac4b355e78ec150c5d067f78638f2da82"><td class="memItemLeft" align="right" valign="top">static <a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html">FeatureExtractor</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html#ac4b355e78ec150c5d067f78638f2da82">PyFeatureExtractor</a> (<a class="el" href="classtvm_1_1meta__schedule_1_1PyFeatureExtractorNode.html#a2c3241bd5c792cb6dab434789 [...]
 <tr class="memdesc:ac4b355e78ec150c5d067f78638f2da82"><td class="mdescLeft">&#160;</td><td class="mdescRight">Create a feature extractor with customized methods on the python-side.  <a href="#ac4b355e78ec150c5d067f78638f2da82">More...</a><br /></td></tr>
 <tr class="separator:ac4b355e78ec150c5d067f78638f2da82"><td class="memSeparator" colspan="2">&#160;</td></tr>
@@ -168,8 +168,8 @@ Additional Inherited Members</h2></td></tr>
 <div class="textblock"><p>Managed reference to <a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractorNode.html" title="Extractor for features from measure candidates for use in cost model. ">FeatureExtractorNode</a>. </p>
 <dl class="section see"><dt>See also</dt><dd><a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractorNode.html" title="Extractor for features from measure candidates for use in cost model. ">FeatureExtractorNode</a> </dd></dl>
 </div><h2 class="groupheader">Member Function Documentation</h2>
-<a id="a70128f6b43830ee2a9a421efec900ba1"></a>
-<h2 class="memtitle"><span class="permalink"><a href="#a70128f6b43830ee2a9a421efec900ba1">&#9670;&nbsp;</a></span>PerStoreFeature()</h2>
+<a id="abbfc563425a975e026f2daf3bbfa86ee"></a>
+<h2 class="memtitle"><span class="permalink"><a href="#abbfc563425a975e026f2daf3bbfa86ee">&#9670;&nbsp;</a></span>PerStoreFeature()</h2>
 
 <div class="memitem">
 <div class="memproto">
@@ -193,7 +193,13 @@ Additional Inherited Members</h2></td></tr>
           <td class="paramkey"></td>
           <td></td>
           <td class="paramtype">int&#160;</td>
-          <td class="paramname"><em>cache_line_bytes</em> = <code>64</code>&#160;</td>
+          <td class="paramname"><em>cache_line_bytes</em> = <code>64</code>, </td>
+        </tr>
+        <tr>
+          <td class="paramkey"></td>
+          <td></td>
+          <td class="paramtype">bool&#160;</td>
+          <td class="paramname"><em>extract_workload</em> = <code>false</code>&#160;</td>
         </tr>
         <tr>
           <td></td>
@@ -214,6 +220,7 @@ Additional Inherited Members</h2></td></tr>
     <tr><td class="paramname">buffers_per_store</td><td>The number of buffers in each BufferStore; Pad or truncate if necessary. </td></tr>
     <tr><td class="paramname">arith_intensity_curve_num_samples</td><td>The number of samples used in the arithmetic intensity curve. </td></tr>
     <tr><td class="paramname">cache_line_bytes</td><td>The number of bytes in a cache line. </td></tr>
+    <tr><td class="paramname">extract_workload</td><td>Whether to extract features in the workload in tuning context or not. </td></tr>
   </table>
   </dd>
 </dl>
diff --git a/docs/reference/api/doxygen/feature__extractor_8h_source.html b/docs/reference/api/doxygen/feature__extractor_8h_source.html
index f9d1d138b..154d29d4e 100644
--- a/docs/reference/api/doxygen/feature__extractor_8h_source.html
+++ b/docs/reference/api/doxygen/feature__extractor_8h_source.html
@@ -66,7 +66,7 @@ $(function() {
 <div class="title">feature_extractor.h</div>  </div>
 </div><!--header-->
 <div class="contents">
-<a href="feature__extractor_8h.html">Go to the documentation of this file.</a><div class="fragment"><div class="line"><a name="l00001"></a><span class="lineno">    1</span>&#160;<span class="comment">/*</span></div><div class="line"><a name="l00002"></a><span class="lineno">    2</span>&#160;<span class="comment"> * Licensed to the Apache Software Foundation (ASF) under one</span></div><div class="line"><a name="l00003"></a><span class="lineno">    3</span>&#160;<span class="comment"> *  [...]
+<a href="feature__extractor_8h.html">Go to the documentation of this file.</a><div class="fragment"><div class="line"><a name="l00001"></a><span class="lineno">    1</span>&#160;<span class="comment">/*</span></div><div class="line"><a name="l00002"></a><span class="lineno">    2</span>&#160;<span class="comment"> * Licensed to the Apache Software Foundation (ASF) under one</span></div><div class="line"><a name="l00003"></a><span class="lineno">    3</span>&#160;<span class="comment"> *  [...]
 <div class="ttc" id="string_8h_html"><div class="ttname"><a href="string_8h.html">string.h</a></div><div class="ttdoc">Runtime String container types. </div></div>
 <div class="ttc" id="classtvm_1_1meta__schedule_1_1PyFeatureExtractorNode_html_aac552d70d771cb4ad079a59d09d0db91"><div class="ttname"><a href="classtvm_1_1meta__schedule_1_1PyFeatureExtractorNode.html#aac552d70d771cb4ad079a59d09d0db91">tvm::meta_schedule::PyFeatureExtractorNode::VisitAttrs</a></div><div class="ttdeci">void VisitAttrs(tvm::AttrVisitor *v)</div><div class="ttdef"><b>Definition:</b> feature_extractor.h:79</div></div>
 <div class="ttc" id="classtvm_1_1meta__schedule_1_1FeatureExtractorNode_html_a93af14b34e4b9d6cba347496440df79b"><div class="ttname"><a href="classtvm_1_1meta__schedule_1_1FeatureExtractorNode.html#a93af14b34e4b9d6cba347496440df79b">tvm::meta_schedule::FeatureExtractorNode::_type_key</a></div><div class="ttdeci">static constexpr const char * _type_key</div><div class="ttdef"><b>Definition:</b> feature_extractor.h:53</div></div>
diff --git a/docs/reference/api/doxygen/functions_func_p.html b/docs/reference/api/doxygen/functions_func_p.html
index 3b0384163..b889c8fb8 100644
--- a/docs/reference/api/doxygen/functions_func_p.html
+++ b/docs/reference/api/doxygen/functions_func_p.html
@@ -123,7 +123,7 @@ $(function() {
 : <a class="el" href="classtvm_1_1runtime_1_1profiling_1_1PercentNode.html#a45d51732fbde990710ac13c294225e39">tvm::runtime::profiling::PercentNode</a>
 </li>
 <li>PerStoreFeature()
-: <a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html#a70128f6b43830ee2a9a421efec900ba1">tvm::meta_schedule::FeatureExtractor</a>
+: <a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html#abbfc563425a975e026f2daf3bbfa86ee">tvm::meta_schedule::FeatureExtractor</a>
 </li>
 <li>PlaceholderOp()
 : <a class="el" href="classtvm_1_1te_1_1PlaceholderOp.html#ae6cedf336ddb311242a1c0b0bb91741a">tvm::te::PlaceholderOp</a>
diff --git a/docs/reference/api/doxygen/functions_p.html b/docs/reference/api/doxygen/functions_p.html
index e5f260d09..c398bcf1b 100644
--- a/docs/reference/api/doxygen/functions_p.html
+++ b/docs/reference/api/doxygen/functions_p.html
@@ -231,7 +231,7 @@ $(function() {
 : <a class="el" href="classtvm_1_1runtime_1_1profiling_1_1PercentNode.html#a45d51732fbde990710ac13c294225e39">tvm::runtime::profiling::PercentNode</a>
 </li>
 <li>PerStoreFeature()
-: <a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html#a70128f6b43830ee2a9a421efec900ba1">tvm::meta_schedule::FeatureExtractor</a>
+: <a class="el" href="classtvm_1_1meta__schedule_1_1FeatureExtractor.html#abbfc563425a975e026f2daf3bbfa86ee">tvm::meta_schedule::FeatureExtractor</a>
 </li>
 <li>PlaceholderOp()
 : <a class="el" href="classtvm_1_1te_1_1PlaceholderOp.html#ae6cedf336ddb311242a1c0b0bb91741a">tvm::te::PlaceholderOp</a>
diff --git a/docs/reference/api/doxygen/search/all_11.js b/docs/reference/api/doxygen/search/all_11.js
index a3b81745b..b637e5b40 100644
--- a/docs/reference/api/doxygen/search/all_11.js
+++ b/docs/reference/api/doxygen/search/all_11.js
@@ -92,7 +92,7 @@ var searchData=
   ['peek',['Peek',['../classtvm_1_1runtime_1_1micro__rpc_1_1FrameBuffer.html#aefdbe684e811791635e77b026b2ca11c',1,'tvm::runtime::micro_rpc::FrameBuffer']]],
   ['percent',['percent',['../classtvm_1_1runtime_1_1profiling_1_1PercentNode.html#a6852f14d052d8b23ad4058d149ec2a46',1,'tvm::runtime::profiling::PercentNode']]],
   ['percentnode',['PercentNode',['../classtvm_1_1runtime_1_1profiling_1_1PercentNode.html',1,'tvm::runtime::profiling::PercentNode'],['../classtvm_1_1runtime_1_1profiling_1_1PercentNode.html#a45d51732fbde990710ac13c294225e39',1,'tvm::runtime::profiling::PercentNode::PercentNode()']]],
-  ['perstorefeature',['PerStoreFeature',['../classtvm_1_1meta__schedule_1_1FeatureExtractor.html#a70128f6b43830ee2a9a421efec900ba1',1,'tvm::meta_schedule::FeatureExtractor']]],
+  ['perstorefeature',['PerStoreFeature',['../classtvm_1_1meta__schedule_1_1FeatureExtractor.html#abbfc563425a975e026f2daf3bbfa86ee',1,'tvm::meta_schedule::FeatureExtractor']]],
   ['pipeline_5fexec_5fscope',['pipeline_exec_scope',['../namespacetvm_1_1tir_1_1attr.html#aee14d4d24b86179fd19938a02bc15512',1,'tvm::tir::attr']]],
   ['pipeline_5fstage_5fscope',['pipeline_stage_scope',['../namespacetvm_1_1tir_1_1attr.html#a19ecbf068afc115a2282e533c0fe518d',1,'tvm::tir::attr']]],
   ['placeholder',['placeholder',['../namespacetvm_1_1te.html#a15a1cc6f7146730ec1f03210c81a8a3c',1,'tvm::te']]],
diff --git a/docs/reference/api/doxygen/search/functions_10.js b/docs/reference/api/doxygen/search/functions_10.js
index 901208dbf..2cc60355b 100644
--- a/docs/reference/api/doxygen/search/functions_10.js
+++ b/docs/reference/api/doxygen/search/functions_10.js
@@ -34,7 +34,7 @@ var searchData=
   ['patternwildcard',['PatternWildcard',['../classtvm_1_1relay_1_1PatternWildcard.html#aab7d1690088beab9987f97cdebd64c0c',1,'tvm::relay::PatternWildcard::PatternWildcard()'],['../classtvm_1_1relay_1_1PatternWildcard.html#a53a536533ee2c7ae4f0fcb649fc967c3',1,'tvm::relay::PatternWildcard::PatternWildcard(ObjectPtr&lt; Object &gt; n)'],['../classtvm_1_1relay_1_1PatternWildcard.html#aea56c9cc7113d61aee41cd6569aef9d5',1,'tvm::relay::PatternWildcard::PatternWildcard(const PatternWildcard &amp; [...]
   ['peek',['Peek',['../classtvm_1_1runtime_1_1micro__rpc_1_1FrameBuffer.html#aefdbe684e811791635e77b026b2ca11c',1,'tvm::runtime::micro_rpc::FrameBuffer']]],
   ['percentnode',['PercentNode',['../classtvm_1_1runtime_1_1profiling_1_1PercentNode.html#a45d51732fbde990710ac13c294225e39',1,'tvm::runtime::profiling::PercentNode']]],
-  ['perstorefeature',['PerStoreFeature',['../classtvm_1_1meta__schedule_1_1FeatureExtractor.html#a70128f6b43830ee2a9a421efec900ba1',1,'tvm::meta_schedule::FeatureExtractor']]],
+  ['perstorefeature',['PerStoreFeature',['../classtvm_1_1meta__schedule_1_1FeatureExtractor.html#abbfc563425a975e026f2daf3bbfa86ee',1,'tvm::meta_schedule::FeatureExtractor']]],
   ['placeholder',['placeholder',['../namespacetvm_1_1te.html#a15a1cc6f7146730ec1f03210c81a8a3c',1,'tvm::te']]],
   ['placeholderop',['PlaceholderOp',['../classtvm_1_1te_1_1PlaceholderOp.html#ae6cedf336ddb311242a1c0b0bb91741a',1,'tvm::te::PlaceholderOp']]],
   ['planandupdatebufferallocationlocation',['PlanAndUpdateBufferAllocationLocation',['../namespacetvm_1_1tir_1_1transform.html#a5ffa51908f8a4c9f7eb4321d8b92c234',1,'tvm::tir::transform']]],
diff --git a/docs/reference/api/python/auto_scheduler.html b/docs/reference/api/python/auto_scheduler.html
index e42e18736..befcca780 100644
--- a/docs/reference/api/python/auto_scheduler.html
+++ b/docs/reference/api/python/auto_scheduler.html
@@ -1597,7 +1597,7 @@ history states as starting point to perform Evolutionary Search).</p></li>
 
 <dl class="py class">
 <dt class="sig sig-object py" id="tvm.auto_scheduler.SketchPolicy">
-<em class="property"><span class="pre">class</span> </em><span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">SketchPolicy</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">program_cost_model</span></span><span class="o"><span class="pre">=</span></span><span class="defau [...]
+<em class="property"><span class="pre">class</span> </em><span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">SketchPolicy</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">program_cost_model</span></span><span class="o"><span class="pre">=</span></span><span class="defau [...]
 <dd><p>The search policy that searches in a hierarchical search space defined by sketches.
 The policy randomly samples programs from the space defined by sketches and use evolutionary
 search to fine-tune them.</p>
@@ -1881,7 +1881,7 @@ Candidates:
 
 <dl class="py function">
 <dt class="sig sig-object py" id="tvm.auto_scheduler.auto_schedule">
-<span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">auto_schedule</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">search_policy</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em clas [...]
+<span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">auto_schedule</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">search_policy</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em clas [...]
 <dd><p>THIS API IS DEPRECATED.</p>
 <p>Run auto scheduling search for a task.</p>
 <dl class="field-list simple">
diff --git a/docs/reference/api/python/topi.html b/docs/reference/api/python/topi.html
index bafed2b8b..9101638b1 100644
--- a/docs/reference/api/python/topi.html
+++ b/docs/reference/api/python/topi.html
@@ -6396,10 +6396,10 @@ a tuple of (left, right) or a string in [‘VALID’, ‘SAME’].</p></li>
 <dd><p>Perform log softmax activation on the data</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
-<dd class="field-odd"><p><strong>data</strong> (<a class="reference internal" href="te.html#tvm.te.Tensor" title="tvm.te.Tensor"><em>tvm.te.Tensor</em></a>) – 2-D input data</p>
+<dd class="field-odd"><p><strong>data</strong> (<a class="reference internal" href="te.html#tvm.te.Tensor" title="tvm.te.Tensor"><em>tvm.te.Tensor</em></a>) – N-D input data</p>
 </dd>
 <dt class="field-even">Returns</dt>
-<dd class="field-even"><p><strong>output</strong> – 2-D output with same shape</p>
+<dd class="field-even"><p><strong>output</strong> – N-D output with same shape</p>
 </dd>
 <dt class="field-odd">Return type</dt>
 <dd class="field-odd"><p><a class="reference internal" href="te.html#tvm.te.Tensor" title="tvm.te.Tensor">tvm.te.Tensor</a></p>
diff --git a/docs/reference/api/typedoc/classes/bytestreamreader.html b/docs/reference/api/typedoc/classes/bytestreamreader.html
index fe93f8016..49c492995 100644
--- a/docs/reference/api/typedoc/classes/bytestreamreader.html
+++ b/docs/reference/api/typedoc/classes/bytestreamreader.html
@@ -119,7 +119,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -141,7 +141,7 @@
 					<div class="tsd-signature tsd-kind-icon">bytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Uint8Array</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -151,7 +151,7 @@
 					<div class="tsd-signature tsd-kind-icon">offset<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L42">rpc_server.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L42">rpc_server.ts:42</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -168,7 +168,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L63">rpc_server.ts:63</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L63">rpc_server.ts:63</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">Uint8Array</span></h4>
@@ -185,7 +185,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L49">rpc_server.ts:49</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L49">rpc_server.ts:49</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -202,7 +202,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L57">rpc_server.ts:57</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L57">rpc_server.ts:57</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
diff --git a/docs/reference/api/typedoc/classes/cachedcallstack.html b/docs/reference/api/typedoc/classes/cachedcallstack.html
index 16aca94ac..d17b3fdf1 100644
--- a/docs/reference/api/typedoc/classes/cachedcallstack.html
+++ b/docs/reference/api/typedoc/classes/cachedcallstack.html
@@ -144,7 +144,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L223">memory.ts:223</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L223">memory.ts:223</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -172,7 +172,7 @@
 					<div class="tsd-signature tsd-kind-icon">temp<wbr>Args<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><a href="../interfaces/disposable.html" class="tsd-signature-type">Disposable</a><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = []</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L208">memory.ts:208</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L208">memory.ts:208</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -194,7 +194,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L312">memory.ts:312</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L312">memory.ts:312</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -226,7 +226,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L284">memory.ts:284</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L284">memory.ts:284</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -262,7 +262,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L388">memory.ts:388</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L388">memory.ts:388</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -300,7 +300,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L376">memory.ts:376</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L376">memory.ts:376</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -340,7 +340,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L267">memory.ts:267</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L267">memory.ts:267</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -373,7 +373,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L243">memory.ts:243</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L243">memory.ts:243</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -390,7 +390,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L321">memory.ts:321</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L321">memory.ts:321</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -422,7 +422,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L252">memory.ts:252</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L252">memory.ts:252</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -444,7 +444,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L359">memory.ts:359</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L359">memory.ts:359</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -470,7 +470,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L342">memory.ts:342</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L342">memory.ts:342</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -496,7 +496,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L350">memory.ts:350</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L350">memory.ts:350</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -522,7 +522,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L326">memory.ts:326</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L326">memory.ts:326</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -548,7 +548,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L363">memory.ts:363</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L363">memory.ts:363</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -574,7 +574,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L346">memory.ts:346</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L346">memory.ts:346</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -600,7 +600,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L334">memory.ts:334</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L334">memory.ts:334</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
diff --git a/docs/reference/api/typedoc/classes/dldatatype.html b/docs/reference/api/typedoc/classes/dldatatype.html
index a60f7ebc2..ee7bf1110 100644
--- a/docs/reference/api/typedoc/classes/dldatatype.html
+++ b/docs/reference/api/typedoc/classes/dldatatype.html
@@ -119,7 +119,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L262">runtime.ts:262</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L262">runtime.ts:262</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -147,7 +147,7 @@
 					<div class="tsd-signature tsd-kind-icon">bits<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L260">runtime.ts:260</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L260">runtime.ts:260</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">code<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L258">runtime.ts:258</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L258">runtime.ts:258</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -177,7 +177,7 @@
 					<div class="tsd-signature tsd-kind-icon">lanes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L262">runtime.ts:262</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L262">runtime.ts:262</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -199,7 +199,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L279">runtime.ts:279</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L279">runtime.ts:279</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -216,7 +216,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L270">runtime.ts:270</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L270">runtime.ts:270</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">string</span></h4>
diff --git a/docs/reference/api/typedoc/classes/dldevice.html b/docs/reference/api/typedoc/classes/dldevice.html
index f495a292f..45b89e857 100644
--- a/docs/reference/api/typedoc/classes/dldevice.html
+++ b/docs/reference/api/typedoc/classes/dldevice.html
@@ -118,7 +118,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L202">runtime.ts:202</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L202">runtime.ts:202</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -146,7 +146,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<wbr>Id<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L200">runtime.ts:200</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L200">runtime.ts:200</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -161,7 +161,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L198">runtime.ts:198</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L198">runtime.ts:198</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -183,7 +183,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L223">runtime.ts:223</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L223">runtime.ts:223</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -205,7 +205,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L230">runtime.ts:230</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L230">runtime.ts:230</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">string</span></h4>
diff --git a/docs/reference/api/typedoc/classes/environment.html b/docs/reference/api/typedoc/classes/environment.html
index 3a1eed917..dea0c3e24 100644
--- a/docs/reference/api/typedoc/classes/environment.html
+++ b/docs/reference/api/typedoc/classes/environment.html
@@ -125,7 +125,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/environment.ts#L86">environment.ts:86</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/environment.ts#L86">environment.ts:86</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -169,7 +169,7 @@
 					<aside class="tsd-sources">
 						<p>Implementation of <a href="../interfaces/libraryprovider.html">LibraryProvider</a>.<a href="../interfaces/libraryprovider.html#imports">imports</a></p>
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/environment.ts#L70">environment.ts:70</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/environment.ts#L70">environment.ts:70</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 					<div class="tsd-signature tsd-kind-icon">logger<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>msg<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/environment.ts#L69">environment.ts:69</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/environment.ts#L69">environment.ts:69</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -210,7 +210,7 @@
 					<div class="tsd-signature tsd-kind-icon">packedCFunc<wbr>Table<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">ctypes.FTVMWasmPackedCFunc</span><span class="tsd-signature-symbol"> | </span><span class="tsd-signature-type">undefined</span><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = [undefined,]</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/environment.ts#L78">environment.ts:78</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/environment.ts#L78">environment.ts:78</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -228,7 +228,7 @@
 					<div class="tsd-signature tsd-kind-icon">packedCFunc<wbr>Table<wbr>Free<wbr>Id<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = []</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/environment.ts#L84">environment.ts:84</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/environment.ts#L84">environment.ts:84</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -250,7 +250,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/environment.ts#L105">environment.ts:105</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/environment.ts#L105">environment.ts:105</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/ffilibrary.html b/docs/reference/api/typedoc/classes/ffilibrary.html
index ca7e9e261..1eccae144 100644
--- a/docs/reference/api/typedoc/classes/ffilibrary.html
+++ b/docs/reference/api/typedoc/classes/ffilibrary.html
@@ -131,7 +131,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L49">runtime.ts:49</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L49">runtime.ts:49</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -156,7 +156,7 @@
 					<div class="tsd-signature tsd-kind-icon">exports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">Function</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L46">runtime.ts:46</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L46">runtime.ts:46</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -166,7 +166,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L45">runtime.ts:45</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L45">runtime.ts:45</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">wasm32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">boolean</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L44">runtime.ts:44</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L44">runtime.ts:44</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -186,7 +186,7 @@
 					<div class="tsd-signature tsd-kind-icon">webGPUContext<span class="tsd-signature-symbol">:</span> <a href="webgpucontext.html" class="tsd-signature-type">WebGPUContext</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L47">runtime.ts:47</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L47">runtime.ts:47</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -203,7 +203,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L76">runtime.ts:76</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L76">runtime.ts:76</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -226,7 +226,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L66">runtime.ts:66</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L66">runtime.ts:66</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -243,7 +243,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L84">runtime.ts:84</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L84">runtime.ts:84</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <a href="cachedcallstack.html" class="tsd-signature-type">CachedCallStack</a></h4>
@@ -260,7 +260,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L95">runtime.ts:95</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L95">runtime.ts:95</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -283,7 +283,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L72">runtime.ts:72</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L72">runtime.ts:72</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
diff --git a/docs/reference/api/typedoc/classes/graphexecutor.html b/docs/reference/api/typedoc/classes/graphexecutor.html
index 47a7effd8..4523b90f6 100644
--- a/docs/reference/api/typedoc/classes/graphexecutor.html
+++ b/docs/reference/api/typedoc/classes/graphexecutor.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L583">runtime.ts:583</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L583">runtime.ts:583</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">module<span class="tsd-signature-symbol">:</span> <a href="module.html" class="tsd-signature-type">Module</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L579">runtime.ts:579</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L579">runtime.ts:579</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L654">runtime.ts:654</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L654">runtime.ts:654</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -224,7 +224,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L597">runtime.ts:597</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L597">runtime.ts:597</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -241,7 +241,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L631">runtime.ts:631</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L631">runtime.ts:631</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -279,7 +279,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L644">runtime.ts:644</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L644">runtime.ts:644</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -310,7 +310,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L621">runtime.ts:621</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L621">runtime.ts:621</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -332,7 +332,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L609">runtime.ts:609</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L609">runtime.ts:609</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/instance.html b/docs/reference/api/typedoc/classes/instance.html
index c164245ef..a3f955097 100644
--- a/docs/reference/api/typedoc/classes/instance.html
+++ b/docs/reference/api/typedoc/classes/instance.html
@@ -139,7 +139,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L692">runtime.ts:692</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L692">runtime.ts:692</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -202,7 +202,7 @@
 					<div class="tsd-signature tsd-kind-icon">exports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">Function</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L684">runtime.ts:684</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L684">runtime.ts:684</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -212,7 +212,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L683">runtime.ts:683</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L683">runtime.ts:683</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -229,7 +229,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L932">runtime.ts:932</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L932">runtime.ts:932</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -260,7 +260,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L994">runtime.ts:994</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L994">runtime.ts:994</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -303,7 +303,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L924">runtime.ts:924</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L924">runtime.ts:924</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -341,7 +341,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L732">runtime.ts:732</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L732">runtime.ts:732</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -358,7 +358,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L952">runtime.ts:952</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L952">runtime.ts:952</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -402,7 +402,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L816">runtime.ts:816</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L816">runtime.ts:816</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -434,7 +434,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L1033">runtime.ts:1033</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L1033">runtime.ts:1033</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -465,7 +465,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L846">runtime.ts:846</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L846">runtime.ts:846</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -497,7 +497,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L750">runtime.ts:750</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L750">runtime.ts:750</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -520,7 +520,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L1013">runtime.ts:1013</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L1013">runtime.ts:1013</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -568,7 +568,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L789">runtime.ts:789</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L789">runtime.ts:789</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -608,7 +608,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L914">runtime.ts:914</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L914">runtime.ts:914</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -646,7 +646,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L1140">runtime.ts:1140</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L1140">runtime.ts:1140</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -698,7 +698,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L740">runtime.ts:740</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L740">runtime.ts:740</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -722,7 +722,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L868">runtime.ts:868</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L868">runtime.ts:868</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -754,7 +754,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L857">runtime.ts:857</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L857">runtime.ts:857</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -786,7 +786,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L940">runtime.ts:940</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L940">runtime.ts:940</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/memory.html b/docs/reference/api/typedoc/classes/memory.html
index d80387c72..afef1e61b 100644
--- a/docs/reference/api/typedoc/classes/memory.html
+++ b/docs/reference/api/typedoc/classes/memory.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L40">memory.ts:40</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L40">memory.ts:40</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -152,7 +152,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Memory</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L32">memory.ts:32</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L32">memory.ts:32</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">wasm32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">boolean</span><span class="tsd-signature-symbol"> = true</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L33">memory.ts:33</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L33">memory.ts:33</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L154">memory.ts:154</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L154">memory.ts:154</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -210,7 +210,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L90">memory.ts:90</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L90">memory.ts:90</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -233,7 +233,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L97">memory.ts:97</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L97">memory.ts:97</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -256,7 +256,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L74">memory.ts:74</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L74">memory.ts:74</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -279,7 +279,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L81">memory.ts:81</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L81">memory.ts:81</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -302,7 +302,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L104">memory.ts:104</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L104">memory.ts:104</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -325,7 +325,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L132">memory.ts:132</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L132">memory.ts:132</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -362,7 +362,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L145">memory.ts:145</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L145">memory.ts:145</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -393,7 +393,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L60">memory.ts:60</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L60">memory.ts:60</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -416,7 +416,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L67">memory.ts:67</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L67">memory.ts:67</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -439,7 +439,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L53">memory.ts:53</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L53">memory.ts:53</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -462,7 +462,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L114">memory.ts:114</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L114">memory.ts:114</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -485,7 +485,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L124">memory.ts:124</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L124">memory.ts:124</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -502,7 +502,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/memory.ts#L175">memory.ts:175</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/memory.ts#L175">memory.ts:175</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/module.html b/docs/reference/api/typedoc/classes/module.html
index 5e14e2075..1b94fc3aa 100644
--- a/docs/reference/api/typedoc/classes/module.html
+++ b/docs/reference/api/typedoc/classes/module.html
@@ -124,7 +124,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L504">runtime.ts:504</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L504">runtime.ts:504</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -170,7 +170,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L502">runtime.ts:502</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L502">runtime.ts:502</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -187,7 +187,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L516">runtime.ts:516</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L516">runtime.ts:516</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -204,7 +204,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L530">runtime.ts:530</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L530">runtime.ts:530</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -236,7 +236,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L561">runtime.ts:561</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L561">runtime.ts:561</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/ndarray.html b/docs/reference/api/typedoc/classes/ndarray.html
index abf541b25..4018c4144 100644
--- a/docs/reference/api/typedoc/classes/ndarray.html
+++ b/docs/reference/api/typedoc/classes/ndarray.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L304">runtime.ts:304</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L304">runtime.ts:304</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -158,7 +158,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<span class="tsd-signature-symbol">:</span> <a href="dldevice.html" class="tsd-signature-type">DLDevice</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L297">runtime.ts:297</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L297">runtime.ts:297</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -173,7 +173,7 @@
 					<div class="tsd-signature tsd-kind-icon">dtype<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L293">runtime.ts:293</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L293">runtime.ts:293</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -188,7 +188,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L289">runtime.ts:289</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L289">runtime.ts:289</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -203,7 +203,7 @@
 					<div class="tsd-signature tsd-kind-icon">ndim<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L291">runtime.ts:291</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L291">runtime.ts:291</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -218,7 +218,7 @@
 					<div class="tsd-signature tsd-kind-icon">shape<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L295">runtime.ts:295</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L295">runtime.ts:295</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -240,7 +240,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L370">runtime.ts:370</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L370">runtime.ts:370</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -273,7 +273,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L414">runtime.ts:414</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L414">runtime.ts:414</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -305,7 +305,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L355">runtime.ts:355</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L355">runtime.ts:355</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -322,7 +322,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L474">runtime.ts:474</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L474">runtime.ts:474</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -346,7 +346,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L443">runtime.ts:443</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L443">runtime.ts:443</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/packedfunccell.html b/docs/reference/api/typedoc/classes/packedfunccell.html
index bbfe6f736..f2a1b3c8b 100644
--- a/docs/reference/api/typedoc/classes/packedfunccell.html
+++ b/docs/reference/api/typedoc/classes/packedfunccell.html
@@ -122,7 +122,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L158">runtime.ts:158</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L158">runtime.ts:158</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -147,7 +147,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L157">runtime.ts:157</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L157">runtime.ts:157</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -164,7 +164,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L165">runtime.ts:165</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L165">runtime.ts:165</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
diff --git a/docs/reference/api/typedoc/classes/rpcserver.html b/docs/reference/api/typedoc/classes/rpcserver.html
index bbbeeaf93..ff6908510 100644
--- a/docs/reference/api/typedoc/classes/rpcserver.html
+++ b/docs/reference/api/typedoc/classes/rpcserver.html
@@ -115,7 +115,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L92">rpc_server.ts:92</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L92">rpc_server.ts:92</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">get<wbr>Imports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">unknown</span><span class="tsd-signat [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L82">rpc_server.ts:82</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L82">rpc_server.ts:82</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -201,7 +201,7 @@
 					<div class="tsd-signature tsd-kind-icon">key<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L78">rpc_server.ts:78</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L78">rpc_server.ts:78</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -211,7 +211,7 @@
 					<div class="tsd-signature tsd-kind-icon">logger<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>msg<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L81">rpc_server.ts:81</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L81">rpc_server.ts:81</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -242,7 +242,7 @@
 					<div class="tsd-signature tsd-kind-icon">socket<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">WebSocket</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L79">rpc_server.ts:79</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L79">rpc_server.ts:79</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -252,7 +252,7 @@
 					<div class="tsd-signature tsd-kind-icon">state<span class="tsd-signature-symbol">:</span> <a href="../enums/rpcserverstate.html" class="tsd-signature-type">RPCServerState</a><span class="tsd-signature-symbol"> = RPCServerState.InitHeader</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L80">rpc_server.ts:80</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L80">rpc_server.ts:80</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -262,7 +262,7 @@
 					<div class="tsd-signature tsd-kind-icon">url<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L77">rpc_server.ts:77</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L77">rpc_server.ts:77</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/classes/scalar.html b/docs/reference/api/typedoc/classes/scalar.html
index c0aee368a..4ed26574d 100644
--- a/docs/reference/api/typedoc/classes/scalar.html
+++ b/docs/reference/api/typedoc/classes/scalar.html
@@ -112,7 +112,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L145">runtime.ts:145</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L145">runtime.ts:145</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -137,7 +137,7 @@
 					<div class="tsd-signature tsd-kind-icon">dtype<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L145">runtime.ts:145</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L145">runtime.ts:145</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -152,7 +152,7 @@
 					<div class="tsd-signature tsd-kind-icon">value<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L143">runtime.ts:143</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L143">runtime.ts:143</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/webgpucontext.html b/docs/reference/api/typedoc/classes/webgpucontext.html
index 6673e01ed..3bc767817 100644
--- a/docs/reference/api/typedoc/classes/webgpucontext.html
+++ b/docs/reference/api/typedoc/classes/webgpucontext.html
@@ -120,7 +120,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/webgpu.ts#L57">webgpu.ts:57</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/webgpu.ts#L57">webgpu.ts:57</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -145,7 +145,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">GPUDevice</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/webgpu.ts#L50">webgpu.ts:50</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/webgpu.ts#L50">webgpu.ts:50</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -155,7 +155,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/webgpu.ts#L51">webgpu.ts:51</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/webgpu.ts#L51">webgpu.ts:51</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -172,7 +172,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/webgpu.ts#L84">webgpu.ts:84</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/webgpu.ts#L84">webgpu.ts:84</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -209,7 +209,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/webgpu.ts#L170">webgpu.ts:170</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/webgpu.ts#L170">webgpu.ts:170</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -238,7 +238,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/webgpu.ts#L67">webgpu.ts:67</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/webgpu.ts#L67">webgpu.ts:67</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/enums/argtypecode.html b/docs/reference/api/typedoc/enums/argtypecode.html
index f76787b84..22f694f0a 100644
--- a/docs/reference/api/typedoc/enums/argtypecode.html
+++ b/docs/reference/api/typedoc/enums/argtypecode.html
@@ -106,7 +106,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLDevice<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 6</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L220">ctypes.ts:220</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L220">ctypes.ts:220</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -116,7 +116,7 @@
 					<div class="tsd-signature tsd-kind-icon">Float<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L216">ctypes.ts:216</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L216">ctypes.ts:216</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -126,7 +126,7 @@
 					<div class="tsd-signature tsd-kind-icon">Int<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L214">ctypes.ts:214</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L214">ctypes.ts:214</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -136,7 +136,7 @@
 					<div class="tsd-signature tsd-kind-icon">Null<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L218">ctypes.ts:218</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L218">ctypes.ts:218</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -146,7 +146,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMBytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 12</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L226">ctypes.ts:226</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L226">ctypes.ts:226</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -156,7 +156,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMDLTensor<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 7</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L221">ctypes.ts:221</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L221">ctypes.ts:221</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -166,7 +166,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMData<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 5</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L219">ctypes.ts:219</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L219">ctypes.ts:219</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMModule<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 9</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L223">ctypes.ts:223</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L223">ctypes.ts:223</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -186,7 +186,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMNDArray<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 13</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L227">ctypes.ts:227</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L227">ctypes.ts:227</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -196,7 +196,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMObject<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L222">ctypes.ts:222</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L222">ctypes.ts:222</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -206,7 +206,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMObjectRValue<wbr>Ref<wbr>Arg<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 14</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L228">ctypes.ts:228</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L228">ctypes.ts:228</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -216,7 +216,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMOpaque<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 3</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L217">ctypes.ts:217</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L217">ctypes.ts:217</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -226,7 +226,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMPacked<wbr>Func<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 10</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L224">ctypes.ts:224</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L224">ctypes.ts:224</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -236,7 +236,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 11</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L225">ctypes.ts:225</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L225">ctypes.ts:225</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -246,7 +246,7 @@
 					<div class="tsd-signature tsd-kind-icon">UInt<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L215">ctypes.ts:215</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L215">ctypes.ts:215</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/aynccallbackcode.html b/docs/reference/api/typedoc/enums/aynccallbackcode.html
index 9ab0193f3..41afeef73 100644
--- a/docs/reference/api/typedoc/enums/aynccallbackcode.html
+++ b/docs/reference/api/typedoc/enums/aynccallbackcode.html
@@ -93,7 +93,7 @@
 					<div class="tsd-signature tsd-kind-icon">k<wbr>Exception<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 5</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L676">runtime.ts:676</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L676">runtime.ts:676</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -103,7 +103,7 @@
 					<div class="tsd-signature tsd-kind-icon">k<wbr>Return<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L675">runtime.ts:675</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L675">runtime.ts:675</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/dldatatypecode.html b/docs/reference/api/typedoc/enums/dldatatypecode.html
index fba51cf36..d09103143 100644
--- a/docs/reference/api/typedoc/enums/dldatatypecode.html
+++ b/docs/reference/api/typedoc/enums/dldatatypecode.html
@@ -95,7 +95,7 @@
 					<div class="tsd-signature tsd-kind-icon">Float<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L242">runtime.ts:242</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L242">runtime.ts:242</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -105,7 +105,7 @@
 					<div class="tsd-signature tsd-kind-icon">Int<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L240">runtime.ts:240</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L240">runtime.ts:240</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -115,7 +115,7 @@
 					<div class="tsd-signature tsd-kind-icon">Opaque<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 3</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L243">runtime.ts:243</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L243">runtime.ts:243</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -125,7 +125,7 @@
 					<div class="tsd-signature tsd-kind-icon">UInt<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L241">runtime.ts:241</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L241">runtime.ts:241</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/rpcserverstate.html b/docs/reference/api/typedoc/enums/rpcserverstate.html
index c86a65127..803f4c655 100644
--- a/docs/reference/api/typedoc/enums/rpcserverstate.html
+++ b/docs/reference/api/typedoc/enums/rpcserverstate.html
@@ -90,7 +90,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Header<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L27">rpc_server.ts:27</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L27">rpc_server.ts:27</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -100,7 +100,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Header<wbr>Key<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L28">rpc_server.ts:28</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L28">rpc_server.ts:28</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -110,7 +110,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Server<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L29">rpc_server.ts:29</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L29">rpc_server.ts:29</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -120,7 +120,7 @@
 					<div class="tsd-signature tsd-kind-icon">Receive<wbr>Packet<wbr>Body<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L32">rpc_server.ts:32</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L32">rpc_server.ts:32</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -130,7 +130,7 @@
 					<div class="tsd-signature tsd-kind-icon">Receive<wbr>Packet<wbr>Header<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L31">rpc_server.ts:31</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L31">rpc_server.ts:31</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -140,7 +140,7 @@
 					<div class="tsd-signature tsd-kind-icon">Wait<wbr>For<wbr>Callback<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L30">rpc_server.ts:30</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L30">rpc_server.ts:30</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/sizeof.html b/docs/reference/api/typedoc/enums/sizeof.html
index b65dad34e..861229017 100644
--- a/docs/reference/api/typedoc/enums/sizeof.html
+++ b/docs/reference/api/typedoc/enums/sizeof.html
@@ -100,7 +100,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLData<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = I32</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L206">ctypes.ts:206</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L206">ctypes.ts:206</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -110,7 +110,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLDevice<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = I32 + I32</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L207">ctypes.ts:207</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L207">ctypes.ts:207</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -120,7 +120,7 @@
 					<div class="tsd-signature tsd-kind-icon">F32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L203">ctypes.ts:203</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L203">ctypes.ts:203</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -130,7 +130,7 @@
 					<div class="tsd-signature tsd-kind-icon">F64<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L204">ctypes.ts:204</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L204">ctypes.ts:204</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -140,7 +140,7 @@
 					<div class="tsd-signature tsd-kind-icon">I32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L201">ctypes.ts:201</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L201">ctypes.ts:201</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -150,7 +150,7 @@
 					<div class="tsd-signature tsd-kind-icon">I64<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L202">ctypes.ts:202</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L202">ctypes.ts:202</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -160,7 +160,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMValue<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L205">ctypes.ts:205</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L205">ctypes.ts:205</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -170,7 +170,7 @@
 					<div class="tsd-signature tsd-kind-icon">U16<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L200">ctypes.ts:200</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L200">ctypes.ts:200</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -180,7 +180,7 @@
 					<div class="tsd-signature tsd-kind-icon">U8<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L199">ctypes.ts:199</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L199">ctypes.ts:199</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/index.html b/docs/reference/api/typedoc/index.html
index 0049bf817..f7d306034 100644
--- a/docs/reference/api/typedoc/index.html
+++ b/docs/reference/api/typedoc/index.html
@@ -174,7 +174,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Alloc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>shape<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, ndim<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, dtypeCode<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, dtypeBits<span class="tsd [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L112">ctypes.ts:112</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L112">ctypes.ts:112</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -238,7 +238,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>From<wbr>Bytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, data<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nbytes<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">num [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L128">ctypes.ts:128</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L128">ctypes.ts:128</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -282,7 +282,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>From<wbr>To<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>from<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, to<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, stream<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-sig [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L144">ctypes.ts:144</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L144">ctypes.ts:144</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -326,7 +326,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>ToBytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, data<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nbytes<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</sp [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L136">ctypes.ts:136</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L136">ctypes.ts:136</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -370,7 +370,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L121">ctypes.ts:121</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L121">ctypes.ts:121</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -406,7 +406,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMBackend<wbr>PackedCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>argValues<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, argCodes<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nargs<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number< [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L160">ctypes.ts:160</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L160">ctypes.ts:160</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -458,7 +458,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMCFunc<wbr>Set<wbr>Return<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>ret<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, value<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCode<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signa [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L77">ctypes.ts:77</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L77">ctypes.ts:77</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -506,7 +506,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMCb<wbr>Arg<wbr>ToReturn<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>value<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, code<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span c [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L83">ctypes.ts:83</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L83">ctypes.ts:83</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -545,7 +545,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Call<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>func<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, argValues<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCode<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-t [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L67">ctypes.ts:67</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L67">ctypes.ts:67</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -601,7 +601,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>func<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L57">ctypes.ts:57</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L57">ctypes.ts:57</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -637,7 +637,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Get<wbr>Global<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>name<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, out<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span cla [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L100">ctypes.ts:100</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L100">ctypes.ts:100</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -676,7 +676,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>List<wbr>Global<wbr>Names<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>outSize<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, outArray<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&g [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L88">ctypes.ts:88</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L88">ctypes.ts:88</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -715,7 +715,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Register<wbr>Global<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>name<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, f<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, override<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</spa [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L94">ctypes.ts:94</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L94">ctypes.ts:94</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -758,7 +758,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMGet<wbr>Last<wbr>Error<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L34">ctypes.ts:34</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L34">ctypes.ts:34</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -788,7 +788,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L52">ctypes.ts:52</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L52">ctypes.ts:52</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -824,7 +824,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Get<wbr>Function<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, funcName<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, queryImports<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">numbe [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L42">ctypes.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L42">ctypes.ts:42</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -872,7 +872,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Import<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, dep<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-si [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L48">ctypes.ts:48</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L48">ctypes.ts:48</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -912,7 +912,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMSynchronize<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>deviceType<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, deviceId<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, stream<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signatur [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L150">ctypes.ts:150</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L150">ctypes.ts:150</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -954,7 +954,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Alloc<wbr>Space<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>size<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L167">ctypes.ts:167</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L167">ctypes.ts:167</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -990,7 +990,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Free<wbr>Space<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>ptr<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L170">ctypes.ts:170</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L170">ctypes.ts:170</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1026,7 +1026,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Func<wbr>Create<wbr>FromCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>resource<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, out<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&g [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L187">ctypes.ts:187</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L187">ctypes.ts:187</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1066,7 +1066,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>PackedCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>args<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCodes<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nargs<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L179">ctypes.ts:179</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L179">ctypes.ts:179</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1118,7 +1118,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>PackedCFunc<wbr>Finalizer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>resourceHandle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L193">ctypes.ts:193</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L193">ctypes.ts:193</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1154,7 +1154,7 @@
 					<div class="tsd-signature tsd-kind-icon">GPUPointer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/webgpu.ts#L25">webgpu.ts:25</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/webgpu.ts#L25">webgpu.ts:25</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1169,7 +1169,7 @@
 					<div class="tsd-signature tsd-kind-icon">Packed<wbr>Func<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">...</span>args<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol"> &amp; </span><a href="interfaces/disp [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L36">runtime.ts:36</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L36">runtime.ts:36</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1184,7 +1184,7 @@
 					<div class="tsd-signature tsd-kind-icon">Pointer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L25">ctypes.ts:25</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L25">ctypes.ts:25</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1199,7 +1199,7 @@
 					<div class="tsd-signature tsd-kind-icon">Ptr<wbr>Offset<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/ctypes.ts#L28">ctypes.ts:28</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/ctypes.ts#L28">ctypes.ts:28</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1217,7 +1217,7 @@
 					<div class="tsd-signature tsd-kind-icon">RPC_<wbr>MAGIC<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">1045105</span><span class="tsd-signature-symbol"> = 1045105</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/rpc_server.ts#L36">rpc_server.ts:36</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/rpc_server.ts#L36">rpc_server.ts:36</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1239,7 +1239,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/support.ts#L25">support.ts:25</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/support.ts#L25">support.ts:25</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1271,7 +1271,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/support.ts#L39">support.ts:39</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/support.ts#L39">support.ts:39</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1300,7 +1300,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/support.ts#L52">support.ts:52</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/support.ts#L52">support.ts:52</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1337,7 +1337,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/compact.ts#L38">compact.ts:38</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/compact.ts#L38">compact.ts:38</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1368,7 +1368,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/webgpu.ts#L30">webgpu.ts:30</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/webgpu.ts#L30">webgpu.ts:30</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1390,7 +1390,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/environment.ts#L32">environment.ts:32</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/environment.ts#L32">environment.ts:32</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1421,7 +1421,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/compact.ts#L24">compact.ts:24</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/compact.ts#L24">compact.ts:24</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1443,7 +1443,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L1362">runtime.ts:1362</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L1362">runtime.ts:1362</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1508,7 +1508,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/support.ts#L62">support.ts:62</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/support.ts#L62">support.ts:62</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1530,7 +1530,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLData<wbr>Type<wbr>Code<wbr>ToStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L246">runtime.ts:246</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L246">runtime.ts:246</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1539,7 +1539,7 @@
 						<div class="tsd-signature tsd-kind-icon">0<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;int&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L247">runtime.ts:247</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L247">runtime.ts:247</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1549,7 +1549,7 @@
 						<div class="tsd-signature tsd-kind-icon">1<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;uint&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L248">runtime.ts:248</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L248">runtime.ts:248</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1559,7 +1559,7 @@
 						<div class="tsd-signature tsd-kind-icon">2<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;float&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L249">runtime.ts:249</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L249">runtime.ts:249</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1569,7 +1569,7 @@
 						<div class="tsd-signature tsd-kind-icon">3<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;handle&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L250">runtime.ts:250</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L250">runtime.ts:250</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1580,7 +1580,7 @@
 					<div class="tsd-signature tsd-kind-icon">Device<wbr>Enum<wbr>ToStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L175">runtime.ts:175</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L175">runtime.ts:175</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1589,7 +1589,7 @@
 						<div class="tsd-signature tsd-kind-icon">1<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;cpu&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L176">runtime.ts:176</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L176">runtime.ts:176</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1599,7 +1599,7 @@
 						<div class="tsd-signature tsd-kind-icon">15<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;webgpu&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L180">runtime.ts:180</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L180">runtime.ts:180</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1609,7 +1609,7 @@
 						<div class="tsd-signature tsd-kind-icon">2<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;cuda&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L177">runtime.ts:177</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L177">runtime.ts:177</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1619,7 +1619,7 @@
 						<div class="tsd-signature tsd-kind-icon">4<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;opencl&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L178">runtime.ts:178</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L178">runtime.ts:178</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1629,7 +1629,7 @@
 						<div class="tsd-signature tsd-kind-icon">8<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;metal&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L179">runtime.ts:179</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L179">runtime.ts:179</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1640,7 +1640,7 @@
 					<div class="tsd-signature tsd-kind-icon">Device<wbr>Str<wbr>ToEnum<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L183">runtime.ts:183</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L183">runtime.ts:183</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1649,7 +1649,7 @@
 						<div class="tsd-signature tsd-kind-icon">cl<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 4</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L186">runtime.ts:186</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L186">runtime.ts:186</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1659,7 +1659,7 @@
 						<div class="tsd-signature tsd-kind-icon">cpu<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 1</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L184">runtime.ts:184</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L184">runtime.ts:184</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1669,7 +1669,7 @@
 						<div class="tsd-signature tsd-kind-icon">cuda<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 2</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L185">runtime.ts:185</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L185">runtime.ts:185</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1679,7 +1679,7 @@
 						<div class="tsd-signature tsd-kind-icon">metal<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 8</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L189">runtime.ts:189</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L189">runtime.ts:189</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1689,7 +1689,7 @@
 						<div class="tsd-signature tsd-kind-icon">opencl<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 4</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L187">runtime.ts:187</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L187">runtime.ts:187</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1699,7 +1699,7 @@
 						<div class="tsd-signature tsd-kind-icon">vulkan<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 7</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L188">runtime.ts:188</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L188">runtime.ts:188</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1709,7 +1709,7 @@
 						<div class="tsd-signature tsd-kind-icon">webgpu<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 15</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/runtime.ts#L190">runtime.ts:190</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/runtime.ts#L190">runtime.ts:190</a></li>
 							</ul>
 						</aside>
 					</section>
diff --git a/docs/reference/api/typedoc/interfaces/disposable.html b/docs/reference/api/typedoc/interfaces/disposable.html
index d803e9f5f..64a3eefba 100644
--- a/docs/reference/api/typedoc/interfaces/disposable.html
+++ b/docs/reference/api/typedoc/interfaces/disposable.html
@@ -113,7 +113,7 @@
 					<div class="tsd-signature tsd-kind-icon">dispose<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/types.ts#L52">types.ts:52</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/types.ts#L52">types.ts:52</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/interfaces/functioninfo.html b/docs/reference/api/typedoc/interfaces/functioninfo.html
index 8c2240653..52a190b06 100644
--- a/docs/reference/api/typedoc/interfaces/functioninfo.html
+++ b/docs/reference/api/typedoc/interfaces/functioninfo.html
@@ -95,7 +95,7 @@
 					<div class="tsd-signature tsd-kind-icon">arg_<wbr>types<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/webgpu.ts#L41">webgpu.ts:41</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/webgpu.ts#L41">webgpu.ts:41</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -105,7 +105,7 @@
 					<div class="tsd-signature tsd-kind-icon">launch_<wbr>param_<wbr>tags<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/webgpu.ts#L42">webgpu.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/webgpu.ts#L42">webgpu.ts:42</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -115,7 +115,7 @@
 					<div class="tsd-signature tsd-kind-icon">name<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/webgpu.ts#L40">webgpu.ts:40</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/webgpu.ts#L40">webgpu.ts:40</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/interfaces/libraryprovider.html b/docs/reference/api/typedoc/interfaces/libraryprovider.html
index 0e4ee2c05..d48b57a4c 100644
--- a/docs/reference/api/typedoc/interfaces/libraryprovider.html
+++ b/docs/reference/api/typedoc/interfaces/libraryprovider.html
@@ -112,7 +112,7 @@
 					<div class="tsd-signature tsd-kind-icon">imports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/types.ts#L34">types.ts:34</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/types.ts#L34">types.ts:34</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -127,7 +127,7 @@
 					<div class="tsd-signature tsd-kind-icon">start<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>inst<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">Instance</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/beea0d2d6/web/src/types.ts#L39">types.ts:39</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/0ae3f5d6c/web/src/types.ts#L39">types.ts:39</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/searchindex.js b/docs/searchindex.js
index 2f521258f..4b12fd2b5 100644
--- a/docs/searchindex.js
+++ b/docs/searchindex.js
@@ -1 +1 @@
-Search.setIndex({docnames:["arch/benchmark","arch/convert_layout","arch/debugger","arch/device_target_interactions","arch/frontend/tensorflow","arch/hybrid_script","arch/index","arch/inferbound","arch/introduction_to_module_serialization","arch/microtvm_design","arch/microtvm_project_api","arch/model_library_format","arch/pass_infra","arch/relay_intro","arch/relay_op_strategy","arch/runtime","arch/runtimes/vulkan","arch/security","arch/virtual_machine","contribute/ci","contribute/code_gu [...]
\ No newline at end of file
+Search.setIndex({docnames:["arch/benchmark","arch/convert_layout","arch/debugger","arch/device_target_interactions","arch/frontend/tensorflow","arch/hybrid_script","arch/index","arch/inferbound","arch/introduction_to_module_serialization","arch/microtvm_design","arch/microtvm_project_api","arch/model_library_format","arch/pass_infra","arch/relay_intro","arch/relay_op_strategy","arch/runtime","arch/runtimes/vulkan","arch/security","arch/virtual_machine","contribute/ci","contribute/code_gu [...]
\ No newline at end of file
diff --git a/docs/topic/vta/tutorials/autotvm/sg_execution_times.html b/docs/topic/vta/tutorials/autotvm/sg_execution_times.html
index 1702e2272..293a63af3 100644
--- a/docs/topic/vta/tutorials/autotvm/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/autotvm/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-autotvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:21.206</strong> total execution time for <strong>topic_vta_tutorials_autotvm</strong> files:</p>
+<p><strong>00:21.176</strong> total execution time for <strong>topic_vta_tutorials_autotvm</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 82%" />
@@ -331,7 +331,7 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_relay_vta.html#sphx-glr-topic-vta-tutorials-autotvm-tune-relay-vta-py"><span class="std std-ref">Auto-tuning a convolutional network on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_vta.py</span></code>)</p></td>
-<td><p>00:21.199</p></td>
+<td><p>00:21.169</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tune_alu_vta.html#sphx-glr-topic-vta-tutorials-autotvm-tune-alu-vta-py"><span class="std std-ref">Auto-tuning a ALU fused op on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_alu_vta.py</span></code>)</p></td>
diff --git a/docs/topic/vta/tutorials/frontend/deploy_classification.html b/docs/topic/vta/tutorials/frontend/deploy_classification.html
index 5d3b2cb94..ae6f6da2c 100644
--- a/docs/topic/vta/tutorials/frontend/deploy_classification.html
+++ b/docs/topic/vta/tutorials/frontend/deploy_classification.html
@@ -566,7 +566,7 @@ and dense layer which will both be executed in fp32 on the CPU.</p></li>
   DeprecationWarning,
 /workspace/vta/tutorials/frontend/deploy_classification.py:213: DeprecationWarning: legacy graph executor behavior of producing json / lib / params will be removed in the next release. Please see documents of tvm.contrib.graph_executor.GraphModule for the  new recommended usage.
   relay_prog, target=tvm.target.Target(target, host=env.target_host), params=params
-resnet18_v1 inference graph built in 22.13s!
+resnet18_v1 inference graph built in 23.68s!
 </pre></div>
 </div>
 </div>
diff --git a/docs/topic/vta/tutorials/frontend/deploy_detection.html b/docs/topic/vta/tutorials/frontend/deploy_detection.html
index 20b2bdb88..881eda89a 100644
--- a/docs/topic/vta/tutorials/frontend/deploy_detection.html
+++ b/docs/topic/vta/tutorials/frontend/deploy_detection.html
@@ -584,7 +584,7 @@ and dense layer which will both be executed in fp32 on the CPU.</p></li>
   &quot;target_host parameter is going to be deprecated. &quot;
 /workspace/python/tvm/relay/build_module.py:411: DeprecationWarning: Please use input parameter mod (tvm.IRModule) instead of deprecated parameter mod (tvm.relay.function.Function)
   DeprecationWarning,
-yolov3-tiny inference graph built in 15.55s!
+yolov3-tiny inference graph built in 16.58s!
 </pre></div>
 </div>
 </div>
diff --git a/docs/topic/vta/tutorials/frontend/sg_execution_times.html b/docs/topic/vta/tutorials/frontend/sg_execution_times.html
index b931839f8..e0117e58a 100644
--- a/docs/topic/vta/tutorials/frontend/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/frontend/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-frontend-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>01:30.924</strong> total execution time for <strong>topic_vta_tutorials_frontend</strong> files:</p>
+<p><strong>01:33.232</strong> total execution time for <strong>topic_vta_tutorials_frontend</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 84%" />
@@ -331,11 +331,11 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="deploy_detection.html#sphx-glr-topic-vta-tutorials-frontend-deploy-detection-py"><span class="std std-ref">Deploy Pretrained Vision Detection Model from Darknet on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_detection.py</span></code>)</p></td>
-<td><p>00:48.270</p></td>
+<td><p>00:49.348</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="deploy_classification.html#sphx-glr-topic-vta-tutorials-frontend-deploy-classification-py"><span class="std std-ref">Deploy Pretrained Vision Model from MxNet on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_classification.py</span></code>)</p></td>
-<td><p>00:42.655</p></td>
+<td><p>00:43.884</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/topic/vta/tutorials/optimize/sg_execution_times.html b/docs/topic/vta/tutorials/optimize/sg_execution_times.html
index a6b20b1d4..4d541d94d 100644
--- a/docs/topic/vta/tutorials/optimize/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/optimize/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-optimize-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:03.258</strong> total execution time for <strong>topic_vta_tutorials_optimize</strong> files:</p>
+<p><strong>00:03.451</strong> total execution time for <strong>topic_vta_tutorials_optimize</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 84%" />
@@ -331,11 +331,11 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="convolution_opt.html#sphx-glr-topic-vta-tutorials-optimize-convolution-opt-py"><span class="std std-ref">2D Convolution Optimization</span></a> (<code class="docutils literal notranslate"><span class="pre">convolution_opt.py</span></code>)</p></td>
-<td><p>00:02.838</p></td>
+<td><p>00:03.057</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="matrix_multiply_opt.html#sphx-glr-topic-vta-tutorials-optimize-matrix-multiply-opt-py"><span class="std std-ref">Matrix Multiply Blocking</span></a> (<code class="docutils literal notranslate"><span class="pre">matrix_multiply_opt.py</span></code>)</p></td>
-<td><p>00:00.420</p></td>
+<td><p>00:00.395</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/topic/vta/tutorials/sg_execution_times.html b/docs/topic/vta/tutorials/sg_execution_times.html
index 2c09a0649..226ee19b5 100644
--- a/docs/topic/vta/tutorials/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:00.759</strong> total execution time for <strong>topic_vta_tutorials</strong> files:</p>
+<p><strong>00:00.694</strong> total execution time for <strong>topic_vta_tutorials</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 81%" />
@@ -331,11 +331,11 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="matrix_multiply.html#sphx-glr-topic-vta-tutorials-matrix-multiply-py"><span class="std std-ref">Simple Matrix Multiply</span></a> (<code class="docutils literal notranslate"><span class="pre">matrix_multiply.py</span></code>)</p></td>
-<td><p>00:00.405</p></td>
+<td><p>00:00.373</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="vta_get_started.html#sphx-glr-topic-vta-tutorials-vta-get-started-py"><span class="std std-ref">Get Started with VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">vta_get_started.py</span></code>)</p></td>
-<td><p>00:00.354</p></td>
+<td><p>00:00.320</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/tutorial/auto_scheduler_matmul_x86.html b/docs/tutorial/auto_scheduler_matmul_x86.html
index cb9174833..bfd25856b 100644
--- a/docs/tutorial/auto_scheduler_matmul_x86.html
+++ b/docs/tutorial/auto_scheduler_matmul_x86.html
@@ -474,9 +474,6 @@ trials, we can load the best schedule from the log file and apply it.</p>
 <a href="../reference/api/python/te.html#tvm.te.Schedule" title="tvm.te.Schedule" class="sphx-glr-backref-module-tvm-te sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">sch</span></a><span class="p">,</span> <a href="../reference/api/python/ir.html#tvm.ir.Array" title="tvm.ir.Array" class="sphx-glr-backref-module-tvm-ir sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">args</span></a> <span class="o">=</span> <a href="../reference/api/pyth [...]
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>*E
-</pre></div>
-</div>
 </div>
 <div class="section" id="inspecting-the-optimized-schedule">
 <h2>Inspecting the Optimized Schedule<a class="headerlink" href="#inspecting-the-optimized-schedule" title="Permalink to this headline">¶</a></h2>
@@ -564,7 +561,7 @@ operator fusion.</p>
 <span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 93.756 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 93.448 ms
 </pre></div>
 </div>
 </div>
@@ -638,7 +635,6 @@ automatically optimize a matrix multiplication, without the need to specify a
 search template.  It ends a series of examples that starts from the Tensor
 Expression (TE) language that demonstrates how TVM can optimize computational
 operations.</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  3.740 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorial-auto-scheduler-matmul-x86-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../_downloads/eac4389b114db015e95cb3cdf8b86b83/auto_scheduler_matmul_x86.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">auto_scheduler_matmul_x86.py</span></code></a></p>
diff --git a/docs/tutorial/autotvm_matmul_x86.html b/docs/tutorial/autotvm_matmul_x86.html
index cb6b31c43..608238734 100644
--- a/docs/tutorial/autotvm_matmul_x86.html
+++ b/docs/tutorial/autotvm_matmul_x86.html
@@ -663,16 +663,16 @@ reduce variance, we take 5 measurements and average them.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>waiting for device...
 device available
 Get devices for measurement successfully!
-No: 1   GFLOPS: 10.52/10.52     result: MeasureResult(costs=(0.0255233368,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5395951271057129, timestamp=1656634327.0577953)       [(&#39;tile_y&#39;, [-1, 1]), (&#39;tile_x&#39;, [-1, 256])],None,80
-No: 2   GFLOPS: 2.91/10.52      result: MeasureResult(costs=(0.0921453436,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6114046573638916, timestamp=1656634328.6957347)       [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 8])],None,32
-No: 3   GFLOPS: 11.81/11.81     result: MeasureResult(costs=(0.022731147,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5623428821563721, timestamp=1656634329.7482014)        [(&#39;tile_y&#39;, [-1, 64]), (&#39;tile_x&#39;, [-1, 32])],None,56
-No: 4   GFLOPS: 1.84/11.81      result: MeasureResult(costs=(0.1455941442,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.4478328227996826, timestamp=1656634332.24224) [(&#39;tile_y&#39;, [-1, 1]), (&#39;tile_x&#39;, [-1, 4])],None,20
-No: 5   GFLOPS: 3.65/11.81      result: MeasureResult(costs=(0.073621501,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.307786464691162, timestamp=1656634333.6774862) [(&#39;tile_y&#39;, [-1, 256]), (&#39;tile_x&#39;, [-1, 16])],None,48
-No: 6   GFLOPS: 1.75/11.81      result: MeasureResult(costs=(0.1537386102,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.578004837036133, timestamp=1656634336.824599) [(&#39;tile_y&#39;, [-1, 512]), (&#39;tile_x&#39;, [-1, 4])],None,29
-No: 7   GFLOPS: 0.87/11.81      result: MeasureResult(costs=(0.3075676484,), error_no=MeasureErrorNo.NO_ERROR, all_cost=5.0429277420043945, timestamp=1656634342.430558)        [(&#39;tile_y&#39;, [-1, 512]), (&#39;tile_x&#39;, [-1, 2])],None,19
-No: 8   GFLOPS: 10.71/11.81     result: MeasureResult(costs=(0.025061531400000003,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.542504072189331, timestamp=1656634342.995255) [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 64])],None,62
-No: 9   GFLOPS: 1.88/11.81      result: MeasureResult(costs=(0.14242153419999998,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.3780105113983154, timestamp=1656634345.491731) [(&#39;tile_y&#39;, [-1, 2]), (&#39;tile_x&#39;, [-1, 2])],None,11
-No: 10  GFLOPS: 2.77/11.81      result: MeasureResult(costs=(0.09689358379999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6534321308135986, timestamp=1656634347.2038527)        [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 4])],None,22
+No: 1   GFLOPS: 10.54/10.54     result: MeasureResult(costs=(0.025478070800000002,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5391578674316406, timestamp=1656694236.2053952)       [(&#39;tile_y&#39;, [-1, 1]), (&#39;tile_x&#39;, [-1, 256])],None,80
+No: 2   GFLOPS: 2.92/10.54      result: MeasureResult(costs=(0.09179072079999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6077759265899658, timestamp=1656694237.8423743)        [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 8])],None,32
+No: 3   GFLOPS: 11.83/11.83     result: MeasureResult(costs=(0.0226969984,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5629618167877197, timestamp=1656694238.9021978)       [(&#39;tile_y&#39;, [-1, 64]), (&#39;tile_x&#39;, [-1, 32])],None,56
+No: 4   GFLOPS: 1.85/11.83      result: MeasureResult(costs=(0.1448239584,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.444626569747925, timestamp=1656694241.902186) [(&#39;tile_y&#39;, [-1, 1]), (&#39;tile_x&#39;, [-1, 4])],None,20
+No: 5   GFLOPS: 3.68/11.83      result: MeasureResult(costs=(0.07296285659999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.3114027976989746, timestamp=1656694243.3455317)        [(&#39;tile_y&#39;, [-1, 256]), (&#39;tile_x&#39;, [-1, 16])],None,48
+No: 6   GFLOPS: 1.84/11.83      result: MeasureResult(costs=(0.1456212594,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.498997926712036, timestamp=1656694245.885647) [(&#39;tile_y&#39;, [-1, 512]), (&#39;tile_x&#39;, [-1, 4])],None,29
+No: 7   GFLOPS: 0.85/11.83      result: MeasureResult(costs=(0.3165628946,), error_no=MeasureErrorNo.NO_ERROR, all_cost=5.185196876525879, timestamp=1656694251.6419764)        [(&#39;tile_y&#39;, [-1, 512]), (&#39;tile_x&#39;, [-1, 2])],None,19
+No: 8   GFLOPS: 10.60/11.83     result: MeasureResult(costs=(0.025324113199999998,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5541675090789795, timestamp=1656694252.2091339)       [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 64])],None,62
+No: 9   GFLOPS: 1.91/11.83      result: MeasureResult(costs=(0.14073400200000002,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.350316047668457, timestamp=1656694254.6798334) [(&#39;tile_y&#39;, [-1, 2]), (&#39;tile_x&#39;, [-1, 2])],None,11
+No: 10  GFLOPS: 2.79/11.83      result: MeasureResult(costs=(0.0961036628,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6410527229309082, timestamp=1656694256.378456)        [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 4])],None,22
 </pre></div>
 </div>
 <p>With tuning completed, we can choose the configuration from the log file that
diff --git a/docs/tutorial/autotvm_relay_x86.html b/docs/tutorial/autotvm_relay_x86.html
index 9fcae55fe..aeb963529 100644
--- a/docs/tutorial/autotvm_relay_x86.html
+++ b/docs/tutorial/autotvm_relay_x86.html
@@ -545,7 +545,7 @@ standard deviation.</p>
 <span class="nb">print</span><span class="p">(</span><a href="https://docs.python.org/3/library/stdtypes.html#dict" title="builtins.dict" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">unoptimized</span></a><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>{&#39;mean&#39;: 496.66298516000097, &#39;median&#39;: 496.44014619998416, &#39;std&#39;: 0.7022675180231087}
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>{&#39;mean&#39;: 497.4941768000008, &#39;median&#39;: 497.148796700003, &#39;std&#39;: 1.2604263698723472}
 </pre></div>
 </div>
 </div>
@@ -700,179 +700,179 @@ depending on the specifics of the model and the target platform.</p>
   &quot;target_host parameter is going to be deprecated. &quot;
 
 [Task  1/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  1/25]  Current/Best:   17.54/  17.54 GFLOPS | Progress: (4/20) | 6.76 s
-[Task  1/25]  Current/Best:    6.15/  17.54 GFLOPS | Progress: (8/20) | 9.23 s
-[Task  1/25]  Current/Best:   11.51/  22.87 GFLOPS | Progress: (12/20) | 11.69 s
-[Task  1/25]  Current/Best:   16.75/  22.87 GFLOPS | Progress: (16/20) | 13.37 s
-[Task  1/25]  Current/Best:   11.51/  23.47 GFLOPS | Progress: (20/20) | 15.16 s Done.
+[Task  1/25]  Current/Best:   17.44/  17.44 GFLOPS | Progress: (4/20) | 6.38 s
+[Task  1/25]  Current/Best:    6.16/  17.44 GFLOPS | Progress: (8/20) | 9.30 s
+[Task  1/25]  Current/Best:   11.48/  22.69 GFLOPS | Progress: (12/20) | 11.76 s
+[Task  1/25]  Current/Best:   16.80/  22.83 GFLOPS | Progress: (16/20) | 13.45 s
+[Task  1/25]  Current/Best:   11.51/  23.85 GFLOPS | Progress: (20/20) | 15.19 s Done.
 
 [Task  2/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  2/25]  Current/Best:   12.26/  12.88 GFLOPS | Progress: (4/20) | 3.94 s
-[Task  2/25]  Current/Best:   14.00/  17.60 GFLOPS | Progress: (8/20) | 5.27 s
-[Task  2/25]  Current/Best:   21.00/  21.00 GFLOPS | Progress: (12/20) | 6.61 s
-[Task  2/25]  Current/Best:   12.28/  21.00 GFLOPS | Progress: (16/20) | 7.86 s
-[Task  2/25]  Current/Best:   19.65/  21.00 GFLOPS | Progress: (20/20) | 9.50 s Done.
+[Task  2/25]  Current/Best:   12.26/  13.12 GFLOPS | Progress: (4/20) | 3.81 s
+[Task  2/25]  Current/Best:   14.20/  18.42 GFLOPS | Progress: (8/20) | 5.14 s
+[Task  2/25]  Current/Best:   20.89/  20.89 GFLOPS | Progress: (12/20) | 6.49 s
+[Task  2/25]  Current/Best:   12.39/  20.89 GFLOPS | Progress: (16/20) | 7.75 s
+[Task  2/25]  Current/Best:   19.69/  20.89 GFLOPS | Progress: (20/20) | 9.34 s Done.
 
 [Task  3/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  3/25]  Current/Best:    1.63/  10.52 GFLOPS | Progress: (4/20) | 5.85 s
-[Task  3/25]  Current/Best:   15.58/  16.87 GFLOPS | Progress: (8/20) | 7.76 s
-[Task  3/25]  Current/Best:   14.90/  16.87 GFLOPS | Progress: (12/20) | 9.47 s
-[Task  3/25]  Current/Best:    7.21/  23.80 GFLOPS | Progress: (16/20) | 11.38 s
-[Task  3/25]  Current/Best:   12.66/  23.80 GFLOPS | Progress: (20/20) | 15.97 s Done.
+[Task  3/25]  Current/Best:    1.63/  10.50 GFLOPS | Progress: (4/20) | 5.86 s
+[Task  3/25]  Current/Best:   15.54/  16.71 GFLOPS | Progress: (8/20) | 7.81 s
+[Task  3/25]  Current/Best:   14.81/  16.71 GFLOPS | Progress: (12/20) | 9.56 s
+[Task  3/25]  Current/Best:    7.13/  23.78 GFLOPS | Progress: (16/20) | 11.49 s
+[Task  3/25]  Current/Best:   12.61/  23.78 GFLOPS | Progress: (20/20) | 16.01 s Done.
 
 [Task  4/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  4/25]  Current/Best:    9.55/  20.42 GFLOPS | Progress: (4/20) | 2.38 s
-[Task  4/25]  Current/Best:    6.87/  20.42 GFLOPS | Progress: (8/20) | 7.19 s
-[Task  4/25]  Current/Best:   22.26/  22.26 GFLOPS | Progress: (12/20) | 12.23 s
-[Task  4/25]  Current/Best:   17.16/  22.26 GFLOPS | Progress: (16/20) | 14.64 s
-[Task  4/25]  Current/Best:   13.17/  22.26 GFLOPS | Progress: (20/20) | 16.62 s Done.
+[Task  4/25]  Current/Best:    9.57/  20.37 GFLOPS | Progress: (4/20) | 2.43 s
+[Task  4/25]  Current/Best:    6.29/  20.37 GFLOPS | Progress: (8/20) | 6.85 s
+[Task  4/25]  Current/Best:   22.41/  22.41 GFLOPS | Progress: (12/20) | 11.47 s
+[Task  4/25]  Current/Best:   16.62/  22.41 GFLOPS | Progress: (16/20) | 13.74 s
+[Task  4/25]  Current/Best:   13.12/  22.41 GFLOPS | Progress: (20/20) | 15.75 s Done.
 
 [Task  5/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  5/25]  Current/Best:    9.59/  10.32 GFLOPS | Progress: (4/20) | 2.59 s
-[Task  5/25]  Current/Best:   11.63/  11.96 GFLOPS | Progress: (8/20) | 4.67 s
-[Task  5/25]  Current/Best:   11.51/  18.04 GFLOPS | Progress: (12/20) | 7.90 s
-[Task  5/25]  Current/Best:   11.61/  22.46 GFLOPS | Progress: (16/20) | 9.36 s
-[Task  5/25]  Current/Best:   11.95/  22.46 GFLOPS | Progress: (20/20) | 11.25 s Done.
+[Task  5/25]  Current/Best:    9.26/  10.13 GFLOPS | Progress: (4/20) | 2.65 s
+[Task  5/25]  Current/Best:   11.61/  12.69 GFLOPS | Progress: (8/20) | 4.77 s
+[Task  5/25]  Current/Best:   11.22/  18.04 GFLOPS | Progress: (12/20) | 7.87 s
+[Task  5/25]  Current/Best:   11.33/  22.53 GFLOPS | Progress: (16/20) | 9.29 s
+[Task  5/25]  Current/Best:   11.99/  22.53 GFLOPS | Progress: (20/20) | 11.23 s Done.
 
 [Task  6/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  6/25]  Current/Best:   12.17/  20.66 GFLOPS | Progress: (4/20) | 4.11 s
-[Task  6/25]  Current/Best:   18.94/  20.66 GFLOPS | Progress: (8/20) | 5.87 s
-[Task  6/25]  Current/Best:   13.28/  20.66 GFLOPS | Progress: (12/20) | 7.83 s
-[Task  6/25]  Current/Best:   20.00/  20.66 GFLOPS | Progress: (16/20) | 10.06 s
-[Task  6/25]  Current/Best:    3.73/  20.66 GFLOPS | Progress: (20/20) | 12.58 s Done.
+[Task  6/25]  Current/Best:   12.17/  20.69 GFLOPS | Progress: (4/20) | 3.95 s
+[Task  6/25]  Current/Best:   18.97/  20.69 GFLOPS | Progress: (8/20) | 5.72 s
+[Task  6/25]  Current/Best:   13.29/  20.69 GFLOPS | Progress: (12/20) | 7.66 s
+[Task  6/25]  Current/Best:   19.97/  20.69 GFLOPS | Progress: (16/20) | 9.95 s
+[Task  6/25]  Current/Best:    3.71/  20.69 GFLOPS | Progress: (20/20) | 12.49 s Done.
 
 [Task  7/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  7/25]  Current/Best:   11.26/  12.72 GFLOPS | Progress: (4/20) | 3.54 s
-[Task  7/25]  Current/Best:   20.20/  21.01 GFLOPS | Progress: (8/20) | 5.07 s
-[Task  7/25]  Current/Best:   14.97/  21.01 GFLOPS | Progress: (12/20) | 6.99 s
-[Task  7/25]  Current/Best:   12.26/  21.01 GFLOPS | Progress: (16/20) | 9.03 s
-[Task  7/25]  Current/Best:    6.27/  21.70 GFLOPS | Progress: (20/20) | 11.49 s Done.
+[Task  7/25]  Current/Best:   11.14/  12.81 GFLOPS | Progress: (4/20) | 3.67 s
+[Task  7/25]  Current/Best:   20.23/  20.96 GFLOPS | Progress: (8/20) | 5.18 s
+[Task  7/25]  Current/Best:   15.85/  20.96 GFLOPS | Progress: (12/20) | 7.08 s
+[Task  7/25]  Current/Best:   12.22/  20.96 GFLOPS | Progress: (16/20) | 9.13 s
+[Task  7/25]  Current/Best:    6.27/  21.69 GFLOPS | Progress: (20/20) | 11.61 s Done.
 
 [Task  8/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  8/25]  Current/Best:    9.82/  14.35 GFLOPS | Progress: (4/20) | 2.93 s
-[Task  8/25]  Current/Best:    9.43/  14.35 GFLOPS | Progress: (8/20) | 8.11 s
-[Task  8/25]  Current/Best:   12.17/  14.35 GFLOPS | Progress: (12/20) | 14.70 s
-[Task  8/25]  Current/Best:   18.78/  18.78 GFLOPS | Progress: (16/20) | 16.83 s
-[Task  8/25]  Current/Best:   19.89/  19.89 GFLOPS | Progress: (20/20) | 24.03 s Done.
+[Task  8/25]  Current/Best:    9.96/  14.04 GFLOPS | Progress: (4/20) | 2.91 s
+[Task  8/25]  Current/Best:    9.17/  14.04 GFLOPS | Progress: (8/20) | 7.75 s
+[Task  8/25]  Current/Best:   12.83/  14.04 GFLOPS | Progress: (12/20) | 14.00 s
+[Task  8/25]  Current/Best:   18.88/  18.88 GFLOPS | Progress: (16/20) | 16.06 s
+[Task  8/25]  Current/Best:   19.66/  19.66 GFLOPS | Progress: (20/20) | 22.73 s Done.
 
 [Task  9/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  9/25]  Current/Best:   14.38/  15.87 GFLOPS | Progress: (4/20) | 11.97 s
-[Task  9/25]  Current/Best:   23.39/  23.39 GFLOPS | Progress: (8/20) | 13.81 s
-[Task  9/25]  Current/Best:    8.25/  23.39 GFLOPS | Progress: (12/20) | 16.39 s
-[Task  9/25]  Current/Best:   17.73/  23.39 GFLOPS | Progress: (16/20) | 19.30 s
-[Task  9/25]  Current/Best:    9.05/  23.39 GFLOPS | Progress: (20/20) | 27.97 s
+[Task  9/25]  Current/Best:   14.32/  15.59 GFLOPS | Progress: (4/20) | 11.97 s
+[Task  9/25]  Current/Best:   23.32/  23.32 GFLOPS | Progress: (8/20) | 13.74 s
+[Task  9/25]  Current/Best:    8.26/  23.32 GFLOPS | Progress: (12/20) | 16.08 s
+[Task  9/25]  Current/Best:   17.99/  23.32 GFLOPS | Progress: (16/20) | 18.79 s
+[Task  9/25]  Current/Best:    8.87/  23.32 GFLOPS | Progress: (20/20) | 26.56 s
 [Task 10/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 10/25]  Current/Best:   18.25/  18.25 GFLOPS | Progress: (4/20) | 2.58 s
-[Task 10/25]  Current/Best:   15.46/  18.25 GFLOPS | Progress: (8/20) | 4.22 s
-[Task 10/25]  Current/Best:   12.43/  18.95 GFLOPS | Progress: (12/20) | 5.77 s
-[Task 10/25]  Current/Best:   19.22/  20.33 GFLOPS | Progress: (16/20) | 6.88 s
-[Task 10/25]  Current/Best:    8.97/  20.33 GFLOPS | Progress: (20/20) | 8.43 s Done.
+[Task 10/25]  Current/Best:   18.20/  18.20 GFLOPS | Progress: (4/20) | 2.57 s
+[Task 10/25]  Current/Best:   15.33/  18.20 GFLOPS | Progress: (8/20) | 4.16 s
+[Task 10/25]  Current/Best:   12.50/  18.78 GFLOPS | Progress: (12/20) | 5.71 s
+[Task 10/25]  Current/Best:   19.02/  20.37 GFLOPS | Progress: (16/20) | 6.82 s
+[Task 10/25]  Current/Best:    8.84/  20.37 GFLOPS | Progress: (20/20) | 8.39 s Done.
 
 [Task 11/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 11/25]  Current/Best:   12.24/  18.03 GFLOPS | Progress: (4/20) | 3.32 s
-[Task 11/25]  Current/Best:   16.90/  18.03 GFLOPS | Progress: (8/20) | 6.15 s
-[Task 11/25]  Current/Best:   18.10/  18.10 GFLOPS | Progress: (12/20) | 8.22 s
-[Task 11/25]  Current/Best:   13.25/  21.09 GFLOPS | Progress: (16/20) | 11.11 s
-[Task 11/25]  Current/Best:   19.44/  21.62 GFLOPS | Progress: (20/20) | 13.22 s Done.
+[Task 11/25]  Current/Best:   12.26/  18.09 GFLOPS | Progress: (4/20) | 3.31 s
+[Task 11/25]  Current/Best:   16.84/  18.09 GFLOPS | Progress: (8/20) | 6.07 s
+[Task 11/25]  Current/Best:   17.87/  18.09 GFLOPS | Progress: (12/20) | 8.09 s
+[Task 11/25]  Current/Best:   13.40/  21.17 GFLOPS | Progress: (16/20) | 10.81 s
+[Task 11/25]  Current/Best:   19.43/  21.55 GFLOPS | Progress: (20/20) | 12.82 s Done.
 
 [Task 12/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 12/25]  Current/Best:    7.79/  18.12 GFLOPS | Progress: (4/20) | 5.83 s
-[Task 12/25]  Current/Best:    5.28/  18.12 GFLOPS | Progress: (8/20) | 9.81 s
-[Task 12/25]  Current/Best:   18.91/  18.91 GFLOPS | Progress: (12/20) | 11.83 s
-[Task 12/25]  Current/Best:   15.39/  18.91 GFLOPS | Progress: (16/20) | 14.75 s
-[Task 12/25]  Current/Best:   15.19/  18.91 GFLOPS | Progress: (20/20) | 16.67 s Done.
+[Task 12/25]  Current/Best:    7.80/  17.95 GFLOPS | Progress: (4/20) | 5.43 s
+[Task 12/25]  Current/Best:    5.19/  17.95 GFLOPS | Progress: (8/20) | 9.16 s
+[Task 12/25]  Current/Best:   18.83/  18.83 GFLOPS | Progress: (12/20) | 11.19 s
+[Task 12/25]  Current/Best:   15.29/  18.83 GFLOPS | Progress: (16/20) | 14.03 s
+[Task 12/25]  Current/Best:   15.11/  18.83 GFLOPS | Progress: (20/20) | 15.96 s Done.
 
 [Task 13/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 13/25]  Current/Best:    8.69/  17.27 GFLOPS | Progress: (4/20) | 3.76 s
-[Task 13/25]  Current/Best:   16.02/  20.83 GFLOPS | Progress: (8/20) | 6.40 s
-[Task 13/25]  Current/Best:   19.62/  21.72 GFLOPS | Progress: (12/20) | 9.41 s
-[Task 13/25]  Current/Best:   12.28/  21.72 GFLOPS | Progress: (16/20) | 12.89 s
-[Task 13/25]  Current/Best:   18.76/  21.72 GFLOPS | Progress: (20/20) | 15.23 s Done.
+[Task 13/25]  Current/Best:    8.78/  17.26 GFLOPS | Progress: (4/20) | 3.73 s
+[Task 13/25]  Current/Best:   16.02/  20.76 GFLOPS | Progress: (8/20) | 6.20 s
+[Task 13/25]  Current/Best:   19.50/  21.23 GFLOPS | Progress: (12/20) | 9.09 s
+[Task 13/25]  Current/Best:   12.25/  21.23 GFLOPS | Progress: (16/20) | 12.54 s
+[Task 13/25]  Current/Best:   18.70/  21.23 GFLOPS | Progress: (20/20) | 14.77 s Done.
 
 [Task 14/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 14/25]  Current/Best:   13.54/  13.54 GFLOPS | Progress: (4/20) | 3.40 s
-[Task 14/25]  Current/Best:    6.09/  13.54 GFLOPS | Progress: (8/20) | 5.57 s
-[Task 14/25]  Current/Best:   21.29/  21.29 GFLOPS | Progress: (12/20) | 8.26 s
-[Task 14/25]  Current/Best:   17.03/  21.29 GFLOPS | Progress: (16/20) | 9.91 s Done.
+[Task 14/25]  Current/Best:   12.27/  13.29 GFLOPS | Progress: (4/20) | 3.41 s
+[Task 14/25]  Current/Best:    6.11/  13.30 GFLOPS | Progress: (8/20) | 5.61 s
+[Task 14/25]  Current/Best:   20.84/  20.84 GFLOPS | Progress: (12/20) | 8.14 s
+[Task 14/25]  Current/Best:   16.53/  20.84 GFLOPS | Progress: (16/20) | 9.78 s Done.
 
-[Task 14/25]  Current/Best:   17.29/  21.29 GFLOPS | Progress: (20/20) | 11.67 s
+[Task 14/25]  Current/Best:   17.33/  20.84 GFLOPS | Progress: (20/20) | 11.50 s
 [Task 15/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 15/25]  Current/Best:   16.09/  17.63 GFLOPS | Progress: (4/20) | 2.73 s
-[Task 15/25]  Current/Best:   14.47/  18.09 GFLOPS | Progress: (8/20) | 4.02 s
-[Task 15/25]  Current/Best:   10.39/  22.31 GFLOPS | Progress: (12/20) | 6.41 s
-[Task 15/25]  Current/Best:   20.40/  22.31 GFLOPS | Progress: (16/20) | 10.05 s
-[Task 15/25]  Current/Best:    9.65/  22.31 GFLOPS | Progress: (20/20) | 11.06 s
+[Task 15/25]  Current/Best:   16.14/  17.67 GFLOPS | Progress: (4/20) | 2.72 s
+[Task 15/25]  Current/Best:   13.31/  18.09 GFLOPS | Progress: (8/20) | 4.02 s
+[Task 15/25]  Current/Best:   10.42/  22.32 GFLOPS | Progress: (12/20) | 6.09 s
+[Task 15/25]  Current/Best:   20.40/  22.32 GFLOPS | Progress: (16/20) | 9.44 s
+[Task 15/25]  Current/Best:    9.71/  22.32 GFLOPS | Progress: (20/20) | 10.46 s
 [Task 16/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 16/25]  Current/Best:   20.76/  20.76 GFLOPS | Progress: (4/20) | 2.94 s
-[Task 16/25]  Current/Best:    3.04/  20.76 GFLOPS | Progress: (8/20) | 4.54 s
-[Task 16/25]  Current/Best:   19.51/  20.76 GFLOPS | Progress: (12/20) | 5.75 s
-[Task 16/25]  Current/Best:   17.42/  20.76 GFLOPS | Progress: (16/20) | 7.10 s
-[Task 16/25]  Current/Best:    9.98/  21.27 GFLOPS | Progress: (20/20) | 9.25 s Done.
+[Task 16/25]  Current/Best:   20.68/  20.68 GFLOPS | Progress: (4/20) | 2.93 s
+[Task 16/25]  Current/Best:    3.04/  20.68 GFLOPS | Progress: (8/20) | 4.55 s
+[Task 16/25]  Current/Best:   19.83/  20.68 GFLOPS | Progress: (12/20) | 5.76 s
+[Task 16/25]  Current/Best:   18.23/  20.68 GFLOPS | Progress: (16/20) | 7.13 s
+[Task 16/25]  Current/Best:    9.95/  22.56 GFLOPS | Progress: (20/20) | 9.16 s Done.
 
 [Task 17/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 17/25]  Current/Best:   12.45/  18.68 GFLOPS | Progress: (4/20) | 4.79 s
-[Task 17/25]  Current/Best:   14.37/  23.33 GFLOPS | Progress: (8/20) | 7.68 s
-[Task 17/25]  Current/Best:   16.74/  23.33 GFLOPS | Progress: (12/20) | 9.73 s
-[Task 17/25]  Current/Best:   16.39/  23.33 GFLOPS | Progress: (16/20) | 11.94 s
-[Task 17/25]  Current/Best:   10.03/  23.33 GFLOPS | Progress: (20/20) | 14.11 s Done.
+[Task 17/25]  Current/Best:   12.86/  18.66 GFLOPS | Progress: (4/20) | 4.73 s
+[Task 17/25]  Current/Best:   14.41/  23.28 GFLOPS | Progress: (8/20) | 7.52 s
+[Task 17/25]  Current/Best:   16.88/  23.28 GFLOPS | Progress: (12/20) | 9.57 s
+[Task 17/25]  Current/Best:   16.51/  23.28 GFLOPS | Progress: (16/20) | 11.71 s
+[Task 17/25]  Current/Best:   10.05/  23.28 GFLOPS | Progress: (20/20) | 13.84 s Done.
 
 [Task 18/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 18/25]  Current/Best:   11.21/  17.92 GFLOPS | Progress: (4/20) | 3.81 s
-[Task 18/25]  Current/Best:   10.57/  19.62 GFLOPS | Progress: (8/20) | 7.51 s
-[Task 18/25]  Current/Best:   19.50/  19.62 GFLOPS | Progress: (12/20) | 9.43 s
-[Task 18/25]  Current/Best:   10.07/  19.62 GFLOPS | Progress: (16/20) | 13.35 s
-[Task 18/25]  Current/Best:   20.57/  20.57 GFLOPS | Progress: (20/20) | 14.84 s Done.
+[Task 18/25]  Current/Best:   11.23/  17.92 GFLOPS | Progress: (4/20) | 3.74 s
+[Task 18/25]  Current/Best:   10.54/  18.68 GFLOPS | Progress: (8/20) | 7.15 s
+[Task 18/25]  Current/Best:   19.47/  19.47 GFLOPS | Progress: (12/20) | 9.08 s
+[Task 18/25]  Current/Best:   10.07/  19.47 GFLOPS | Progress: (16/20) | 12.71 s
+[Task 18/25]  Current/Best:   20.56/  20.56 GFLOPS | Progress: (20/20) | 14.21 s Done.
 
 [Task 19/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 19/25]  Current/Best:    7.09/  20.38 GFLOPS | Progress: (4/20) | 6.09 s
-[Task 19/25]  Current/Best:    2.61/  20.38 GFLOPS | Progress: (8/20) | 9.38 s
-[Task 19/25]  Current/Best:   20.22/  21.80 GFLOPS | Progress: (12/20) | 12.38 s
-[Task 19/25]  Current/Best:   14.49/  21.80 GFLOPS | Progress: (16/20) | 15.42 s
-[Task 19/25]  Current/Best:    2.70/  23.65 GFLOPS | Progress: (20/20) | 18.23 s Done.
+[Task 19/25]  Current/Best:    7.21/  20.33 GFLOPS | Progress: (4/20) | 6.02 s
+[Task 19/25]  Current/Best:    2.61/  20.33 GFLOPS | Progress: (8/20) | 9.27 s
+[Task 19/25]  Current/Best:   19.61/  21.39 GFLOPS | Progress: (12/20) | 12.11 s
+[Task 19/25]  Current/Best:   15.55/  21.39 GFLOPS | Progress: (16/20) | 15.01 s
+[Task 19/25]  Current/Best:    2.70/  23.59 GFLOPS | Progress: (20/20) | 17.78 s Done.
 
 [Task 20/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 20/25]  Current/Best:    8.94/  15.17 GFLOPS | Progress: (4/20) | 3.32 s Done.
+[Task 20/25]  Current/Best:    9.39/  15.17 GFLOPS | Progress: (4/20) | 3.30 s Done.
  Done.
 
-[Task 20/25]  Current/Best:    9.75/  15.17 GFLOPS | Progress: (8/20) | 6.73 s
-[Task 20/25]  Current/Best:    2.32/  15.17 GFLOPS | Progress: (12/20) | 10.65 s
-[Task 20/25]  Current/Best:   12.56/  15.17 GFLOPS | Progress: (16/20) | 14.58 s
-[Task 20/25]  Current/Best:   12.37/  21.97 GFLOPS | Progress: (20/20) | 16.68 s
+[Task 20/25]  Current/Best:    9.80/  15.17 GFLOPS | Progress: (8/20) | 6.72 s
+[Task 20/25]  Current/Best:    2.32/  16.57 GFLOPS | Progress: (12/20) | 10.70 s
+[Task 20/25]  Current/Best:   12.34/  16.57 GFLOPS | Progress: (16/20) | 14.49 s
+[Task 20/25]  Current/Best:   13.44/  22.10 GFLOPS | Progress: (20/20) | 16.60 s
 [Task 21/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 21/25]  Current/Best:    6.41/  17.74 GFLOPS | Progress: (4/20) | 3.28 s
-[Task 21/25]  Current/Best:   14.61/  17.74 GFLOPS | Progress: (8/20) | 4.86 s
-[Task 21/25]  Current/Best:    1.61/  17.74 GFLOPS | Progress: (12/20) | 6.99 s
-[Task 21/25]  Current/Best:   17.89/  17.89 GFLOPS | Progress: (16/20) | 10.51 s
-[Task 21/25]  Current/Best:    4.47/  17.89 GFLOPS | Progress: (20/20) | 17.84 s
+[Task 21/25]  Current/Best:    6.39/  17.62 GFLOPS | Progress: (4/20) | 3.26 s
+[Task 21/25]  Current/Best:   14.66/  17.62 GFLOPS | Progress: (8/20) | 4.84 s
+[Task 21/25]  Current/Best:    1.61/  17.62 GFLOPS | Progress: (12/20) | 6.98 s
+[Task 21/25]  Current/Best:   18.17/  18.17 GFLOPS | Progress: (16/20) | 10.42 s
+[Task 21/25]  Current/Best:    4.47/  18.17 GFLOPS | Progress: (20/20) | 17.56 s
 [Task 22/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
 [Task 22/25]  Current/Best:    2.70/  16.97 GFLOPS | Progress: (4/20) | 2.66 s
-[Task 22/25]  Current/Best:    8.64/  21.96 GFLOPS | Progress: (8/20) | 4.65 s
-[Task 22/25]  Current/Best:   20.05/  21.96 GFLOPS | Progress: (12/20) | 7.05 s
-[Task 22/25]  Current/Best:   15.36/  21.96 GFLOPS | Progress: (16/20) | 9.21 s
-[Task 22/25]  Current/Best:   14.24/  21.96 GFLOPS | Progress: (20/20) | 10.94 s Done.
+[Task 22/25]  Current/Best:    8.66/  21.89 GFLOPS | Progress: (8/20) | 4.58 s
+[Task 22/25]  Current/Best:   20.03/  21.89 GFLOPS | Progress: (12/20) | 6.90 s
+[Task 22/25]  Current/Best:   15.55/  21.89 GFLOPS | Progress: (16/20) | 8.96 s
+[Task 22/25]  Current/Best:   14.12/  21.89 GFLOPS | Progress: (20/20) | 10.63 s Done.
 
 [Task 23/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 23/25]  Current/Best:   17.59/  20.57 GFLOPS | Progress: (4/20) | 3.22 s
-[Task 23/25]  Current/Best:   14.69/  20.57 GFLOPS | Progress: (8/20) | 6.50 s
-[Task 23/25]  Current/Best:   20.95/  21.77 GFLOPS | Progress: (12/20) | 8.34 s
-[Task 23/25]  Current/Best:    6.36/  21.77 GFLOPS | Progress: (16/20) | 15.50 s
-[Task 23/25]  Current/Best:    7.89/  21.77 GFLOPS | Progress: (20/20) | 19.73 s Done.
+[Task 23/25]  Current/Best:   17.69/  20.83 GFLOPS | Progress: (4/20) | 3.23 s
+[Task 23/25]  Current/Best:   14.09/  20.83 GFLOPS | Progress: (8/20) | 6.53 s
+[Task 23/25]  Current/Best:   20.97/  21.68 GFLOPS | Progress: (12/20) | 8.35 s
+[Task 23/25]  Current/Best:    6.44/  21.68 GFLOPS | Progress: (16/20) | 15.41 s
+[Task 23/25]  Current/Best:    7.91/  21.68 GFLOPS | Progress: (20/20) | 19.58 s Done.
 
 [Task 24/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 24/25]  Current/Best:    8.37/   8.37 GFLOPS | Progress: (4/20) | 11.78 s
-[Task 24/25]  Current/Best:    2.08/   8.37 GFLOPS | Progress: (8/20) | 22.79 s
-[Task 24/25]  Current/Best:    4.33/   8.37 GFLOPS | Progress: (12/20) | 34.33 s Done.
+[Task 24/25]  Current/Best:    8.25/   8.25 GFLOPS | Progress: (4/20) | 11.77 s
+[Task 24/25]  Current/Best:    3.48/   8.25 GFLOPS | Progress: (8/20) | 23.02 s
+[Task 24/25]  Current/Best:    4.22/   8.25 GFLOPS | Progress: (12/20) | 33.74 s Done.
  Done.
 
-[Task 24/25]  Current/Best:    6.02/   8.74 GFLOPS | Progress: (16/20) | 40.25 s
-[Task 24/25]  Current/Best:    3.31/   9.15 GFLOPS | Progress: (20/20) | 46.22 s Done.
+[Task 24/25]  Current/Best:    6.06/   8.88 GFLOPS | Progress: (16/20) | 39.13 s
+[Task 24/25]  Current/Best:    3.34/   8.88 GFLOPS | Progress: (20/20) | 45.02 s Done.
 
 [Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 25/25]  Current/Best:    1.55/   2.77 GFLOPS | Progress: (4/20) | 11.59 s
-[Task 25/25]  Current/Best:    5.81/   7.65 GFLOPS | Progress: (8/20) | 22.88 s
-[Task 25/25]  Current/Best:    5.90/   7.65 GFLOPS | Progress: (12/20) | 34.17 s
-[Task 25/25]  Current/Best:    5.84/   8.38 GFLOPS | Progress: (16/20) | 35.99 s
-[Task 25/25]  Current/Best:    2.88/   8.59 GFLOPS | Progress: (20/20) | 46.65 s
+[Task 25/25]  Current/Best:    1.55/   2.87 GFLOPS | Progress: (4/20) | 11.60 s
+[Task 25/25]  Current/Best:    5.73/   7.90 GFLOPS | Progress: (8/20) | 22.90 s
+[Task 25/25]  Current/Best:    5.96/   7.90 GFLOPS | Progress: (12/20) | 34.16 s
+[Task 25/25]  Current/Best:    5.80/   8.57 GFLOPS | Progress: (16/20) | 35.94 s
+[Task 25/25]  Current/Best:    2.86/   9.21 GFLOPS | Progress: (20/20) | 46.61 s
 </pre></div>
 </div>
 <p>The output from this tuning process will look something like this:</p>
@@ -975,8 +975,8 @@ improvement in comparing the optimized model to the unoptimized model.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;unoptimized: </span><span class="si">%s</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="p">(</span><a href="https://docs.python.org/3/library/stdtypes.html#dict" title="builtins.dict" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">unoptimized</span></a><span class="p">))</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>optimized: {&#39;mean&#39;: 412.61378658001377, &#39;median&#39;: 412.78359264999835, &#39;std&#39;: 0.6393139363011658}
-unoptimized: {&#39;mean&#39;: 496.66298516000097, &#39;median&#39;: 496.44014619998416, &#39;std&#39;: 0.7022675180231087}
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>optimized: {&#39;mean&#39;: 408.9233894200015, &#39;median&#39;: 408.743179150008, &#39;std&#39;: 0.6943345236249036}
+unoptimized: {&#39;mean&#39;: 497.4941768000008, &#39;median&#39;: 497.148796700003, &#39;std&#39;: 1.2604263698723472}
 </pre></div>
 </div>
 </div>
@@ -990,7 +990,7 @@ models.</p>
 <p>Here we presented a simple example using ResNet-50 v2 locally. However, TVM
 supports many more features including cross-compilation, remote execution and
 profiling/benchmarking.</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 10 minutes  25.178 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 10 minutes  17.460 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorial-autotvm-relay-x86-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../_downloads/57a45d9bef1af358191e7d50043e652c/autotvm_relay_x86.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">autotvm_relay_x86.py</span></code></a></p>
diff --git a/docs/tutorial/cross_compilation_and_rpc.html b/docs/tutorial/cross_compilation_and_rpc.html
index 74fb80dbf..4cd6b5949 100644
--- a/docs/tutorial/cross_compilation_and_rpc.html
+++ b/docs/tutorial/cross_compilation_and_rpc.html
@@ -521,7 +521,7 @@ device and returns the measured cost. Network overhead is excluded.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;</span><span class="si">%g</span><span class="s2"> secs/op&quot;</span> <span class="o">%</span> <span class="n">cost</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>1.373e-07 secs/op
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>1.31e-07 secs/op
 </pre></div>
 </div>
 </div>
diff --git a/docs/tutorial/intro_topi.html b/docs/tutorial/intro_topi.html
index 07a79aa00..8ea2ea6a2 100644
--- a/docs/tutorial/intro_topi.html
+++ b/docs/tutorial/intro_topi.html
@@ -478,7 +478,7 @@ we can schedule the following series of operations ending with <code class="code
 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><a href="../reference/api/python/ir.html#tvm.ir.Array" title="tvm.ir.Array" class="sphx-glr-backref-module-tvm-ir sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">sg</span><span class="o">.</span><span class="n">stages</span></a><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>[stage(a, placeholder(a, 0x10ceaad0)), stage(b, placeholder(b, 0x56ccdf0)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[i [...]
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>[stage(a, placeholder(a, 0x1176ff40)), stage(b, placeholder(b, 0x229c9870)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[ [...]
 </pre></div>
 </div>
 <p>We can test the correctness by comparing with <code class="code docutils literal notranslate"><span class="pre">numpy</span></code> result as follows</p>
diff --git a/docs/tutorial/sg_execution_times.html b/docs/tutorial/sg_execution_times.html
index 58d56513b..a0a7d2dc1 100644
--- a/docs/tutorial/sg_execution_times.html
+++ b/docs/tutorial/sg_execution_times.html
@@ -322,7 +322,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-tutorial-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>13:24.222</strong> total execution time for <strong>tutorial</strong> files:</p>
+<p><strong>13:09.209</strong> total execution time for <strong>tutorial</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 83%" />
@@ -331,50 +331,50 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="autotvm_relay_x86.html#sphx-glr-tutorial-autotvm-relay-x86-py"><span class="std std-ref">Compiling and Optimizing a Model with the Python Interface (AutoTVM)</span></a> (<code class="docutils literal notranslate"><span class="pre">autotvm_relay_x86.py</span></code>)</p></td>
-<td><p>10:25.178</p></td>
+<td><p>10:17.460</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="auto_scheduler_matmul_x86.html#sphx-glr-tutorial-auto-scheduler-matmul-x86-py"><span class="std std-ref">Optimizing Operators with Auto-scheduling</span></a> (<code class="docutils literal notranslate"><span class="pre">auto_scheduler_matmul_x86.py</span></code>)</p></td>
-<td><p>01:03.740</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="tensor_expr_get_started.html#sphx-glr-tutorial-tensor-expr-get-started-py"><span class="std std-ref">Working with Operators Using Tensor Expression</span></a> (<code class="docutils literal notranslate"><span class="pre">tensor_expr_get_started.py</span></code>)</p></td>
+<td><p>01:00.292</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="tensor_expr_get_started.html#sphx-glr-tutorial-tensor-expr-get-started-py"><span class="std std-ref">Working with Operators Using Tensor Expression</span></a> (<code class="docutils literal notranslate"><span class="pre">tensor_expr_get_started.py</span></code>)</p></td>
-<td><p>01:02.036</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="auto_scheduler_matmul_x86.html#sphx-glr-tutorial-auto-scheduler-matmul-x86-py"><span class="std std-ref">Optimizing Operators with Auto-scheduling</span></a> (<code class="docutils literal notranslate"><span class="pre">auto_scheduler_matmul_x86.py</span></code>)</p></td>
+<td><p>00:58.137</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="relay_quick_start.html#sphx-glr-tutorial-relay-quick-start-py"><span class="std std-ref">Quick Start Tutorial for Compiling Deep Learning Models</span></a> (<code class="docutils literal notranslate"><span class="pre">relay_quick_start.py</span></code>)</p></td>
-<td><p>00:27.991</p></td>
+<td><p>00:27.733</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="autotvm_matmul_x86.html#sphx-glr-tutorial-autotvm-matmul-x86-py"><span class="std std-ref">Optimizing Operators with Schedule Templates and AutoTVM</span></a> (<code class="docutils literal notranslate"><span class="pre">autotvm_matmul_x86.py</span></code>)</p></td>
-<td><p>00:23.609</p></td>
+<td><p>00:23.625</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tensor_ir_blitz_course.html#sphx-glr-tutorial-tensor-ir-blitz-course-py"><span class="std std-ref">Blitz Course to TensorIR</span></a> (<code class="docutils literal notranslate"><span class="pre">tensor_ir_blitz_course.py</span></code>)</p></td>
-<td><p>00:00.810</p></td>
+<td><p>00:01.132</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="intro_topi.html#sphx-glr-tutorial-intro-topi-py"><span class="std std-ref">Introduction to TOPI</span></a> (<code class="docutils literal notranslate"><span class="pre">intro_topi.py</span></code>)</p></td>
-<td><p>00:00.693</p></td>
+<td><p>00:00.685</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="cross_compilation_and_rpc.html#sphx-glr-tutorial-cross-compilation-and-rpc-py"><span class="std std-ref">Cross Compilation and RPC</span></a> (<code class="docutils literal notranslate"><span class="pre">cross_compilation_and_rpc.py</span></code>)</p></td>
-<td><p>00:00.157</p></td>
+<td><p>00:00.139</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="introduction.html#sphx-glr-tutorial-introduction-py"><span class="std std-ref">Introduction</span></a> (<code class="docutils literal notranslate"><span class="pre">introduction.py</span></code>)</p></td>
-<td><p>00:00.004</p></td>
+<td><p>00:00.005</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tvmc_python.html#sphx-glr-tutorial-tvmc-python-py"><span class="std std-ref">Getting Starting using TVMC Python: a high-level API for TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">tvmc_python.py</span></code>)</p></td>
 <td><p>00:00.001</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="tvmc_command_line_driver.html#sphx-glr-tutorial-tvmc-command-line-driver-py"><span class="std std-ref">Compiling and Optimizing a Model with TVMC</span></a> (<code class="docutils literal notranslate"><span class="pre">tvmc_command_line_driver.py</span></code>)</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="install.html#sphx-glr-tutorial-install-py"><span class="std std-ref">Installing TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">install.py</span></code>)</p></td>
 <td><p>00:00.001</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="install.html#sphx-glr-tutorial-install-py"><span class="std std-ref">Installing TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">install.py</span></code>)</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="tvmc_command_line_driver.html#sphx-glr-tutorial-tvmc-command-line-driver-py"><span class="std std-ref">Compiling and Optimizing a Model with TVMC</span></a> (<code class="docutils literal notranslate"><span class="pre">tvmc_command_line_driver.py</span></code>)</p></td>
 <td><p>00:00.001</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
diff --git a/docs/tutorial/tensor_expr_get_started.html b/docs/tutorial/tensor_expr_get_started.html
index dfdfbee43..70366d1a2 100644
--- a/docs/tutorial/tensor_expr_get_started.html
+++ b/docs/tutorial/tensor_expr_get_started.html
@@ -537,7 +537,7 @@ helper function to run a profile of the TVM generated code.</p>
 </pre></div>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.000008
-naive: 0.000008
+naive: 0.000006
 </pre></div>
 </div>
 </div>
@@ -588,7 +588,7 @@ compile and run this new schedule with the parallel operation applied:</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-parallel: 0.000008
+parallel: 0.000006
 </pre></div>
 </div>
 </div>
@@ -662,10 +662,10 @@ vector: 0.000025
 </pre></div>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Operator                  Timing             Performance
-   numpy    7.98896000105742e-06                     1.0
-   naive               7.532e-06      0.9428010653455597
-parallel              7.9626e-06      0.9967004464844071
-  vector    2.4634800000000003e-05    3.0836053750099306
+   numpy    8.199280000553699e-06                    1.0
+   naive              5.8584e-06      0.7145017610819951
+parallel    6.0600000000000004e-06    0.7390892858386063
+  vector             2.46396e-05       3.005093129925565
 </pre></div>
 </div>
 <div class="admonition-code-specialization admonition">
@@ -981,7 +981,7 @@ matrix multiplication.</p>
 <span class="n">answer</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">b</span><span class="o">.</span><span class="n">numpy</span><span class="p">())</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.018837
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.017751
 </pre></div>
 </div>
 <p>Now we write a basic matrix multiplication using TVM TE and verify that it
@@ -1024,7 +1024,7 @@ optimizations.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-none: 3.508006
+none: 3.416307
 </pre></div>
 </div>
 <p>Let’s take a look at the intermediate representation of the operator and
@@ -1091,7 +1091,7 @@ schedule.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-blocking: 0.293005
+blocking: 0.279631
 </pre></div>
 </div>
 <p>By reordering the computation to take advantage of caching, you should see a
@@ -1152,7 +1152,7 @@ already cache friendly from our previous optimizations.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-vectorization: 0.328433
+vectorization: 0.318937
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1209,7 +1209,7 @@ more cache friendly.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-loop permutation: 0.119415
+loop permutation: 0.116747
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1287,7 +1287,7 @@ optimized schedule.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-array packing: 0.110499
+array packing: 0.108995
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1363,7 +1363,7 @@ to `C</cite> when all the block results are ready.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-block caching: 0.111033
+block caching: 0.110385
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1432,7 +1432,7 @@ of thread-level parallelization.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-parallelization: 0.145281
+parallelization: 0.142919
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1494,13 +1494,13 @@ working, we can compare the results.</p>
 </pre></div>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>        Operator                  Timing             Performance
-            none            3.5080056702                     1.0
-        blocking            0.2930054273     0.08352478725705567
-   vectorization            0.3284332889     0.09362393330489549
-loop permutation     0.11941494689999999    0.034040693809138524
-   array packing            0.1104993643    0.031499197746080086
-   block caching     0.11103318059999998    0.031651368623263856
- parallelization            0.1452810395    0.041414140442856566
+            none             3.416307389                     1.0
+        blocking            0.2796306091     0.08185171217331579
+   vectorization            0.3189368985      0.0933572018510188
+loop permutation     0.11674748589999999      0.0341735893777912
+   array packing     0.10899485709999998     0.03190428866294267
+   block caching     0.11038474670000001     0.03231112840004457
+ parallelization     0.14291868489999998    0.041834258053059514
 </pre></div>
 </div>
 <p>Note that the outputs on the web page reflect the running times on a
@@ -1532,7 +1532,7 @@ is</p>
 you can build generic templates of the matrix multiplication and other
 operations with tunable parameters that allows you to automatically optimize
 the computation for specific platforms.</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  2.036 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  0.292 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorial-tensor-expr-get-started-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../_downloads/40a01cffb015a67aaec0fad7e27cf80d/tensor_expr_get_started.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tensor_expr_get_started.py</span></code></a></p>