You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by tq...@apache.org on 2022/07/28 21:34:12 UTC
[tvm-site] branch asf-site updated: deploying docs (apache/tvm@ebbce649f08dca6a870e7845febc39125b240001)

This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/tvm-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 45fcdd4c0 deploying docs (apache/tvm@ebbce649f08dca6a870e7845febc39125b240001)
45fcdd4c0 is described below

commit 45fcdd4c0e0ede6e74d3525127f5e0bb4430d677
Author: tvm-bot <95...@users.noreply.github.com>
AuthorDate: Thu Jul 28 21:34:06 2022 +0000

    deploying docs (apache/tvm@ebbce649f08dca6a870e7845febc39125b240001)
---
 .../micro_aot.ipynb                                | 144 +++++
 .../f8a7209a0e66b246185bfc41bbc82f54/micro_aot.py  | 180 ++++++
 docs/_images/sphx_glr_micro_aot_thumb.png          | Bin 0 -> 26794 bytes
 .../how_to/compile_models/from_darknet.rst.txt     |   2 +-
 .../how_to/compile_models/from_mxnet.rst.txt       |   2 +-
 .../how_to/compile_models/from_oneflow.rst.txt     |   2 +-
 .../how_to/compile_models/from_pytorch.rst.txt     |   2 +-
 .../how_to/compile_models/from_tensorflow.rst.txt  |   2 +-
 .../compile_models/sg_execution_times.rst.txt      |  22 +-
 .../deploy_models/deploy_model_on_android.rst.txt  |   2 +-
 .../deploy_object_detection_pytorch.rst.txt        |   4 +-
 .../deploy_models/deploy_prequantized.rst.txt      |   6 +-
 .../deploy_prequantized_tflite.rst.txt             |   4 +-
 .../how_to/deploy_models/deploy_quantized.rst.txt  |   2 +-
 .../deploy_models/deploy_ssd_gluoncv.rst.txt       |   4 +-
 .../deploy_models/sg_execution_times.rst.txt       |  16 +-
 .../extend_tvm/bring_your_own_datatypes.rst.txt    |   4 +-
 .../how_to/extend_tvm/sg_execution_times.rst.txt   |   8 +-
 .../how_to/extend_tvm/use_pass_instrument.rst.txt  |  16 +-
 .../optimize_operators/opt_conv_cuda.rst.txt       |   2 +-
 .../optimize_operators/opt_conv_tensorcore.rst.txt |   2 +-
 .../how_to/optimize_operators/opt_gemm.rst.txt     |  16 +-
 .../optimize_operators/sg_execution_times.rst.txt  |   8 +-
 .../sg_execution_times.rst.txt                     |  14 +-
 .../tune_conv2d_layer_cuda.rst.txt                 | 438 +++++++++++----
 .../tune_network_cuda.rst.txt                      |   2 +-
 .../tune_network_x86.rst.txt                       |   4 +-
 .../tune_sparse_x86.rst.txt                        |  82 ++-
 .../tune_with_autotvm/sg_execution_times.rst.txt   |   6 +-
 .../tune_with_autotvm/tune_conv2d_cuda.rst.txt     |  26 +-
 .../how_to/work_with_microtvm/index.rst.txt        |  18 +
 .../how_to/work_with_microtvm/micro_aot.rst.txt    | 289 ++++++++++
 .../work_with_microtvm/micro_autotune.rst.txt      |  16 +-
 .../how_to/work_with_microtvm/micro_train.rst.txt  |  16 +-
 .../work_with_microtvm/sg_execution_times.rst.txt  |  10 +-
 .../work_with_relay/sg_execution_times.rst.txt     |   8 +-
 .../how_to/work_with_schedules/intrin_math.rst.txt |   2 +-
 .../work_with_schedules/sg_execution_times.rst.txt |  18 +-
 .../how_to/work_with_schedules/tensorize.rst.txt   |   2 +-
 .../tutorials/autotvm/sg_execution_times.rst.txt   |   4 +-
 .../frontend/deploy_classification.rst.txt         |   2 +-
 .../tutorials/frontend/deploy_detection.rst.txt    |   2 +-
 .../tutorials/frontend/sg_execution_times.rst.txt  |   6 +-
 .../tutorials/optimize/sg_execution_times.rst.txt  |   6 +-
 .../topic/vta/tutorials/sg_execution_times.rst.txt |   6 +-
 .../tutorial/auto_scheduler_matmul_x86.rst.txt     |   6 +-
 docs/_sources/tutorial/autotvm_matmul_x86.rst.txt  |  20 +-
 docs/_sources/tutorial/autotvm_relay_x86.rst.txt   |  54 +-
 .../tutorial/cross_compilation_and_rpc.rst.txt     |   2 +-
 docs/_sources/tutorial/intro_topi.rst.txt          |   2 +-
 docs/_sources/tutorial/sg_execution_times.rst.txt  |  22 +-
 .../tutorial/tensor_expr_get_started.rst.txt       |  44 +-
 docs/commit_hash                                   |   2 +-
 docs/how_to/compile_models/from_darknet.html       |   2 +-
 docs/how_to/compile_models/from_mxnet.html         |   2 +-
 docs/how_to/compile_models/from_oneflow.html       |  15 +-
 docs/how_to/compile_models/from_pytorch.html       |   6 +-
 docs/how_to/compile_models/from_tensorflow.html    |   2 +-
 docs/how_to/compile_models/sg_execution_times.html |  30 +-
 .../deploy_models/deploy_model_on_android.html     |   2 +-
 .../deploy_object_detection_pytorch.html           |  45 +-
 docs/how_to/deploy_models/deploy_prequantized.html |   8 +-
 .../deploy_models/deploy_prequantized_tflite.html  |   4 +-
 docs/how_to/deploy_models/deploy_quantized.html    |   2 +-
 docs/how_to/deploy_models/deploy_ssd_gluoncv.html  |  36 +-
 docs/how_to/deploy_models/sg_execution_times.html  |  16 +-
 .../extend_tvm/bring_your_own_datatypes.html       |   4 +-
 docs/how_to/extend_tvm/sg_execution_times.html     |   8 +-
 docs/how_to/extend_tvm/use_pass_instrument.html    |  16 +-
 docs/how_to/optimize_operators/opt_conv_cuda.html  |   2 +-
 .../optimize_operators/opt_conv_tensorcore.html    |   2 +-
 docs/how_to/optimize_operators/opt_gemm.html       |  16 +-
 .../optimize_operators/sg_execution_times.html     |   8 +-
 .../sg_execution_times.html                        |  14 +-
 .../tune_conv2d_layer_cuda.html                    | 438 +++++++++++----
 .../tune_with_autoscheduler/tune_network_cuda.html |   2 +-
 .../tune_with_autoscheduler/tune_network_x86.html  |   4 +-
 .../tune_with_autoscheduler/tune_sparse_x86.html   |  82 ++-
 .../tune_with_autotvm/sg_execution_times.html      |   6 +-
 .../how_to/tune_with_autotvm/tune_conv2d_cuda.html |  26 +-
 docs/how_to/work_with_microtvm/index.html          |  10 +-
 docs/how_to/work_with_microtvm/micro_aot.html      | 605 +++++++++++++++++++++
 docs/how_to/work_with_microtvm/micro_autotune.html |  21 +-
 docs/how_to/work_with_microtvm/micro_ethosu.html   |   1 +
 .../work_with_microtvm/micro_reference_vm.html     |   1 +
 docs/how_to/work_with_microtvm/micro_tflite.html   |   1 +
 docs/how_to/work_with_microtvm/micro_train.html    |  17 +-
 docs/how_to/work_with_microtvm/micro_tvmc.html     |   1 +
 .../work_with_microtvm/sg_execution_times.html     |  20 +-
 .../how_to/work_with_relay/sg_execution_times.html |   8 +-
 docs/how_to/work_with_schedules/intrin_math.html   |   2 +-
 .../work_with_schedules/sg_execution_times.html    |  18 +-
 docs/how_to/work_with_schedules/tensorize.html     |   2 +-
 docs/objects.inv                                   | Bin 22693 -> 22854 bytes
 docs/reference/api/python/auto_scheduler.html      |   4 +-
 .../api/typedoc/classes/bytestreamreader.html      |  12 +-
 .../api/typedoc/classes/cachedcallstack.html       |  34 +-
 docs/reference/api/typedoc/classes/dldatatype.html |  12 +-
 docs/reference/api/typedoc/classes/dldevice.html   |  10 +-
 .../reference/api/typedoc/classes/environment.html |  12 +-
 docs/reference/api/typedoc/classes/ffilibrary.html |  20 +-
 .../api/typedoc/classes/graphexecutor.html         |  16 +-
 docs/reference/api/typedoc/classes/instance.html   |  40 +-
 docs/reference/api/typedoc/classes/memory.html     |  34 +-
 docs/reference/api/typedoc/classes/module.html     |  10 +-
 docs/reference/api/typedoc/classes/ndarray.html    |  22 +-
 .../api/typedoc/classes/packedfunccell.html        |   6 +-
 docs/reference/api/typedoc/classes/rpcserver.html  |  14 +-
 docs/reference/api/typedoc/classes/scalar.html     |   6 +-
 .../api/typedoc/classes/webgpucontext.html         |  12 +-
 docs/reference/api/typedoc/enums/argtypecode.html  |  30 +-
 .../api/typedoc/enums/aynccallbackcode.html        |   4 +-
 .../api/typedoc/enums/dldatatypecode.html          |   8 +-
 .../api/typedoc/enums/rpcserverstate.html          |  12 +-
 docs/reference/api/typedoc/enums/sizeof.html       |  18 +-
 docs/reference/api/typedoc/index.html              | 112 ++--
 .../api/typedoc/interfaces/disposable.html         |   2 +-
 .../api/typedoc/interfaces/functioninfo.html       |   6 +-
 .../api/typedoc/interfaces/libraryprovider.html    |   4 +-
 docs/searchindex.js                                |   2 +-
 .../vta/tutorials/autotvm/sg_execution_times.html  |   4 +-
 .../tutorials/frontend/deploy_classification.html  |   2 +-
 .../vta/tutorials/frontend/deploy_detection.html   |   2 +-
 .../vta/tutorials/frontend/sg_execution_times.html |   6 +-
 .../vta/tutorials/optimize/sg_execution_times.html |   6 +-
 docs/topic/vta/tutorials/sg_execution_times.html   |   6 +-
 docs/tutorial/auto_scheduler_matmul_x86.html       |   5 +-
 docs/tutorial/autotvm_matmul_x86.html              |  20 +-
 docs/tutorial/autotvm_relay_x86.html               | 258 ++++-----
 docs/tutorial/cross_compilation_and_rpc.html       |   2 +-
 docs/tutorial/intro_topi.html                      |   2 +-
 docs/tutorial/sg_execution_times.html              |  24 +-
 docs/tutorial/tensor_expr_get_started.html         |  44 +-
 133 files changed, 2879 insertions(+), 1048 deletions(-)

diff --git a/docs/_downloads/c00933f3fbcf90c4f584d54607b33805/micro_aot.ipynb b/docs/_downloads/c00933f3fbcf90c4f584d54607b33805/micro_aot.ipynb
new file mode 100644
index 000000000..c03d3cb82
--- /dev/null
+++ b/docs/_downloads/c00933f3fbcf90c4f584d54607b33805/micro_aot.ipynb
@@ -0,0 +1,144 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%matplotlib inline"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n\n# microTVM Host-Driven AoT\n**Authors**:\n[Mehrdad Hessar](https://github.com/mehrdadh),\n[Alan MacDonald](https://github.com/alanmacd)\n\nThis tutorial is showcasing microTVM host-driven AoT compilation with\na TFLite model. AoTExecutor reduces the overhead of parsing graph at runtime \ncompared to GraphExecutor. Also, we can have better memory management using ahead \nof time compilation. This tutorial can be executed on a x86 CPU using C runtime (CRT)\nor on Zephyr platfo [...]
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\nimport pathlib\nimport json\nimport os\n\nimport tvm\nfrom tvm import relay\nfrom tvm.relay.backend import Executor, Runtime\nfrom tvm.contrib.download import download_testdata"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Import a TFLite model\n\nTo begin with, download and import a Keyword Spotting TFLite model.\nThis model is originally from [MLPerf Tiny repository](https://github.com/mlcommons/tiny).\nTo test this model, we use samples from [KWS dataset provided by Google](https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html).\n\n**Note:** By default this tutorial runs on x86 CPU using CRT, if you would like to run on Zephyr platform\nyou need to export `TVM_MICRO_USE_HW [...]
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "use_physical_hw = bool(os.getenv(\"TVM_MICRO_USE_HW\"))\nMODEL_URL = \"https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/model/keyword_spotting_quant.tflite\"\nMODEL_PATH = download_testdata(MODEL_URL, \"keyword_spotting_quant.tflite\", module=\"model\")\nSAMPLE_URL = \"https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/data/keyword_spotting_int8_6.pyc.npy\"\nSAMPLE_PATH = download_testdata(SAMPLE_URL, \"keyword_spotting_int8_6.pyc.npy\", module=\"data [...]
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Defining the target\n\nNow we need to define the target, runtime and executor. In this tutorial, we focused on\nusing AOT host driven executor. We use the host micro target which is for running a model\non x86 CPU using CRT runtime or running a model with Zephyr platform on qemu_x86 simulator\nboard. In the case of a physical microcontroller, we get the target model for the physical\nboard (E.g. nucleo_l4r5zi) and pass it to `tvm.target.target.micro` to create a full\nmicro t [...]
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "# Use the C runtime (crt) and enable static linking by setting system-lib to True\nRUNTIME = Runtime(\"crt\", {\"system-lib\": True})\n\n# Simulate a microcontroller on the host machine. Uses the main() from `src/runtime/crt/host/main.cc <https://github.com/apache/tvm/blob/main/src/runtime/crt/host/main.cc>`_.\n# To use physical hardware, replace \"host\" with something matching your hardware.\nTARGET = tvm.target.target.micro(\"host\")\n\n# Use the AOT executor rather than grap [...]
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Compile the model\n\nNow, we compile the model for the target:\n\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "with tvm.transform.PassContext(opt_level=3, config={\"tir.disable_vectorize\": True}):\n    module = tvm.relay.build(\n        relay_mod, target=TARGET, params=params, runtime=RUNTIME, executor=EXECUTOR\n    )"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Create a microTVM project\n\nNow that we have the compiled model as an IRModule, we need to create a firmware project\nto use the compiled model with microTVM. To do this, we use Project API. We have defined\nCRT and Zephyr microTVM template projects which are used for x86 CPU and Zephyr boards\nrespectively.\n\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects(\"crt\"))\nproject_options = {}  # You can use options to provide platform-specific options through TVM.\n\nif use_physical_hw:\n    template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects(\"zephyr\"))\n    project_options = {\"project_type\": \"host_driven\", \"zephyr_board\": BOARD}\n\ntemp_dir = tvm.contrib.utils.tempdir()\ngenerated_project_dir = temp_dir / \"project\"\nprojec [...]
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Build, flash and execute the model\nNext, we build the microTVM project and flash it. Flash step is specific to\nphysical microcontrollers and it is skipped if it is simulating a microcontroller\nvia the host main.cc or if a Zephyr emulated board is selected as the target.\nNext, we define the labels for the model output and execute the model with a\nsample with expected value of 6 (label: left).\n\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "project.build()\nproject.flash()\n\nlabels = [\n    \"_silence_\",\n    \"_unknown_\",\n    \"yes\",\n    \"no\",\n    \"up\",\n    \"down\",\n    \"left\",\n    \"right\",\n    \"on\",\n    \"off\",\n    \"stop\",\n    \"go\",\n]\nwith tvm.micro.Session(project.transport()) as session:\n    aot_executor = tvm.runtime.executor.aot_executor.AotModule(session.create_aot_executor())\n    sample = np.load(SAMPLE_PATH)\n    aot_executor.get_input(INPUT_NAME).copyfrom(sample)\n    aot [...]
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.7.5"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
diff --git a/docs/_downloads/f8a7209a0e66b246185bfc41bbc82f54/micro_aot.py b/docs/_downloads/f8a7209a0e66b246185bfc41bbc82f54/micro_aot.py
new file mode 100644
index 000000000..9a177559e
--- /dev/null
+++ b/docs/_downloads/f8a7209a0e66b246185bfc41bbc82f54/micro_aot.py
@@ -0,0 +1,180 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""
+.. _tutorial-micro-AoT:
+
+microTVM Host-Driven AoT
+===========================
+**Authors**:
+`Mehrdad Hessar <https://github.com/mehrdadh>`_,
+`Alan MacDonald <https://github.com/alanmacd>`_
+
+This tutorial is showcasing microTVM host-driven AoT compilation with
+a TFLite model. AoTExecutor reduces the overhead of parsing graph at runtime 
+compared to GraphExecutor. Also, we can have better memory management using ahead 
+of time compilation. This tutorial can be executed on a x86 CPU using C runtime (CRT)
+or on Zephyr platform on a microcontroller/board supported by Zephyr.
+"""
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
+import numpy as np
+import pathlib
+import json
+import os
+
+import tvm
+from tvm import relay
+from tvm.relay.backend import Executor, Runtime
+from tvm.contrib.download import download_testdata
+
+######################################################################
+# Import a TFLite model
+# ---------------------
+#
+# To begin with, download and import a Keyword Spotting TFLite model.
+# This model is originally from `MLPerf Tiny repository <https://github.com/mlcommons/tiny>`_.
+# To test this model, we use samples from `KWS dataset provided by Google <https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html>`_.
+#
+# **Note:** By default this tutorial runs on x86 CPU using CRT, if you would like to run on Zephyr platform
+# you need to export `TVM_MICRO_USE_HW` environment variable.
+#
+use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
+MODEL_URL = "https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/model/keyword_spotting_quant.tflite"
+MODEL_PATH = download_testdata(MODEL_URL, "keyword_spotting_quant.tflite", module="model")
+SAMPLE_URL = "https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/data/keyword_spotting_int8_6.pyc.npy"
+SAMPLE_PATH = download_testdata(SAMPLE_URL, "keyword_spotting_int8_6.pyc.npy", module="data")
+
+tflite_model_buf = open(MODEL_PATH, "rb").read()
+try:
+    import tflite
+
+    tflite_model = tflite.Model.GetRootAsModel(tflite_model_buf, 0)
+except AttributeError:
+    import tflite.Model
+
+    tflite_model = tflite.Model.Model.GetRootAsModel(tflite_model_buf, 0)
+
+input_shape = (1, 49, 10, 1)
+INPUT_NAME = "input_1"
+relay_mod, params = relay.frontend.from_tflite(
+    tflite_model, shape_dict={INPUT_NAME: input_shape}, dtype_dict={INPUT_NAME: "int8"}
+)
+
+######################################################################
+# Defining the target
+# -------------------
+#
+# Now we need to define the target, runtime and executor. In this tutorial, we focused on
+# using AOT host driven executor. We use the host micro target which is for running a model
+# on x86 CPU using CRT runtime or running a model with Zephyr platform on qemu_x86 simulator
+# board. In the case of a physical microcontroller, we get the target model for the physical
+# board (E.g. nucleo_l4r5zi) and pass it to `tvm.target.target.micro` to create a full
+# micro target.
+#
+
+# Use the C runtime (crt) and enable static linking by setting system-lib to True
+RUNTIME = Runtime("crt", {"system-lib": True})
+
+# Simulate a microcontroller on the host machine. Uses the main() from `src/runtime/crt/host/main.cc <https://github.com/apache/tvm/blob/main/src/runtime/crt/host/main.cc>`_.
+# To use physical hardware, replace "host" with something matching your hardware.
+TARGET = tvm.target.target.micro("host")
+
+# Use the AOT executor rather than graph or vm executors. Don't use unpacked API or C calling style.
+EXECUTOR = Executor("aot")
+
+if use_physical_hw:
+    boards_file = pathlib.Path(tvm.micro.get_microtvm_template_projects("zephyr")) / "boards.json"
+    with open(boards_file) as f:
+        boards = json.load(f)
+    BOARD = os.getenv("TVM_MICRO_BOARD", default="nucleo_l4r5zi")
+    TARGET = tvm.target.target.micro(boards[BOARD]["model"])
+
+######################################################################
+# Compile the model
+# -----------------
+#
+# Now, we compile the model for the target:
+#
+with tvm.transform.PassContext(opt_level=3, config={"tir.disable_vectorize": True}):
+    module = tvm.relay.build(
+        relay_mod, target=TARGET, params=params, runtime=RUNTIME, executor=EXECUTOR
+    )
+
+######################################################################
+# Create a microTVM project
+# -------------------------
+#
+# Now that we have the compiled model as an IRModule, we need to create a firmware project
+# to use the compiled model with microTVM. To do this, we use Project API. We have defined
+# CRT and Zephyr microTVM template projects which are used for x86 CPU and Zephyr boards
+# respectively.
+#
+template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects("crt"))
+project_options = {}  # You can use options to provide platform-specific options through TVM.
+
+if use_physical_hw:
+    template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects("zephyr"))
+    project_options = {"project_type": "host_driven", "zephyr_board": BOARD}
+
+temp_dir = tvm.contrib.utils.tempdir()
+generated_project_dir = temp_dir / "project"
+project = tvm.micro.generate_project(
+    template_project_path, module, generated_project_dir, project_options
+)
+
+######################################################################
+# Build, flash and execute the model
+# ----------------------------------
+# Next, we build the microTVM project and flash it. Flash step is specific to
+# physical microcontrollers and it is skipped if it is simulating a microcontroller
+# via the host main.cc or if a Zephyr emulated board is selected as the target.
+# Next, we define the labels for the model output and execute the model with a
+# sample with expected value of 6 (label: left).
+#
+project.build()
+project.flash()
+
+labels = [
+    "_silence_",
+    "_unknown_",
+    "yes",
+    "no",
+    "up",
+    "down",
+    "left",
+    "right",
+    "on",
+    "off",
+    "stop",
+    "go",
+]
+with tvm.micro.Session(project.transport()) as session:
+    aot_executor = tvm.runtime.executor.aot_executor.AotModule(session.create_aot_executor())
+    sample = np.load(SAMPLE_PATH)
+    aot_executor.get_input(INPUT_NAME).copyfrom(sample)
+    aot_executor.run()
+    result = aot_executor.get_output(0).numpy()
+    print(f"Label is `{labels[np.argmax(result)]}` with index `{np.argmax(result)}`")
+#
+# Output:
+# Label is `left` with index `6`
+#
diff --git a/docs/_images/sphx_glr_micro_aot_thumb.png b/docs/_images/sphx_glr_micro_aot_thumb.png
new file mode 100644
index 000000000..8a5fed589
Binary files /dev/null and b/docs/_images/sphx_glr_micro_aot_thumb.png differ
diff --git a/docs/_sources/how_to/compile_models/from_darknet.rst.txt b/docs/_sources/how_to/compile_models/from_darknet.rst.txt
index 5861f2d3f..3882cb05b 100644
--- a/docs/_sources/how_to/compile_models/from_darknet.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_darknet.rst.txt
@@ -317,7 +317,7 @@ The process is no different from other examples.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  2.779 seconds)
+   **Total running time of the script:** ( 1 minutes  4.745 seconds)
 
 
 .. _sphx_glr_download_how_to_compile_models_from_darknet.py:
diff --git a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
index 61f1fca71..e421b1553 100644
--- a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
@@ -115,7 +115,7 @@ In this section, we download a pretrained imagenet model and classify an image.
 
  .. code-block:: none
 
-    Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zipfcddb97c-cd25-48a0-9e13-dae7422c5359 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
+    Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip2c0a41eb-8ea5-4907-8fef-fd0b4319dbfd from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
     x (1, 3, 224, 224)
 
 
diff --git a/docs/_sources/how_to/compile_models/from_oneflow.rst.txt b/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
index 95d9c1461..cc33bb202 100644
--- a/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
@@ -113,7 +113,7 @@ Load a pretrained OneFlow model and save model
  .. code-block:: none
 
     Downloading: "https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip" to /workspace/.oneflow/flowvision_cache/resnet18.zip
-
      0%|          | 0.00/41.5M [00:00<?, ?B/s]
     15%|#5        | 6.33M/41.5M [00:00<00:00, 40.9MB/s]
     25%|##4       | 10.2M/41.5M [00:00<00:01, 28.7MB/s]
     39%|###8      | 16.0M/41.5M [00:00<00:01, 26.5MB/s]
     58%|#####7    | 24.0M/41.5M [00:00<00:00, 30.8MB/s]
     77%|#######7  | 32.0M/41.5M [00:00<00:00, 41.4MB/s]
     96%|#########6| 40.0M/41.5M [00:01<00:00, 46.5MB/s]
    100%|##########| 41.5M/41.5M [00:01<00:00, 40.3MB/s]
+
      0%|          | 0.00/41.5M [00:00<?, ?B/s]
     15%|#5        | 6.33M/41.5M [00:00<00:00, 41.9MB/s]
     25%|##4       | 10.3M/41.5M [00:00<00:01, 32.1MB/s]
     35%|###4      | 14.3M/41.5M [00:00<00:01, 26.2MB/s]
     42%|####2     | 17.5M/41.5M [00:00<00:00, 27.9MB/s]
     58%|#####7    | 24.0M/41.5M [00:00<00:00, 32.5MB/s]
     77%|#######7  | 32.0M/41.5M [00:00<00:00, 41.0MB/s]
     92%|#########2| 38.3M/41.5M [00:01<00:00, 42.7MB/s]
    100%|##########| 41.5M/41.5M [00:01<00:00, 37.6MB/s]
 
 
 
diff --git a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
index 07598b48d..3d6f7580b 100644
--- a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
@@ -94,7 +94,7 @@ Load a pretrained PyTorch model
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /workspace/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
-
      0%|          | 0.00/44.7M [00:00<?, ?B/s]
     37%|###7      | 16.6M/44.7M [00:00<00:00, 174MB/s]
     74%|#######4  | 33.3M/44.7M [00:00<00:00, 174MB/s]
    100%|##########| 44.7M/44.7M [00:00<00:00, 187MB/s]
+
      0%|          | 0.00/44.7M [00:00<?, ?B/s]
     44%|####4     | 19.7M/44.7M [00:00<00:00, 206MB/s]
     96%|#########6| 43.1M/44.7M [00:00<00:00, 229MB/s]
    100%|##########| 44.7M/44.7M [00:00<00:00, 227MB/s]
 
 
 
diff --git a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
index 3a12212e6..1dda0fc65 100644
--- a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
@@ -423,7 +423,7 @@ Run the corresponding model on tensorflow
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  4.223 seconds)
+   **Total running time of the script:** ( 1 minutes  4.067 seconds)
 
 
 .. _sphx_glr_download_how_to_compile_models_from_tensorflow.py:
diff --git a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
index 36b624c01..4cced9405 100644
--- a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
@@ -5,26 +5,26 @@
 
 Computation times
 =================
-**04:58.294** total execution time for **how_to_compile_models** files:
+**05:10.189** total execution time for **how_to_compile_models** files:
 
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``) | 01:04.223 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)       | 01:04.745 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)       | 01:02.779 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``) | 01:04.067 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)         | 00:37.794 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)         | 00:40.949 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_oneflow.py` (``from_oneflow.py``)       | 00:26.951 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_oneflow.py` (``from_oneflow.py``)       | 00:29.185 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)         | 00:25.243 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)           | 00:25.321 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)           | 00:24.041 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)         | 00:24.691 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)         | 00:21.237 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)         | 00:22.786 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)       | 00:19.060 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)       | 00:20.700 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)           | 00:14.560 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)           | 00:15.335 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)             | 00:02.406 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)             | 00:02.409 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
index c9c13b589..7460f48db 100644
--- a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
@@ -441,7 +441,7 @@ Execute on TVM
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      15.4404      15.4349      15.4982      15.3954       0.0273   
+      16.5195      16.4907      17.1171      15.9516       0.4699   
                
 
 
diff --git a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
index a707c4c06..2f695b98c 100644
--- a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
@@ -123,7 +123,7 @@ Load pre-trained maskrcnn from torchvision and do tracing
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /workspace/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
-
      0%|          | 0.00/170M [00:00<?, ?B/s]
     12%|#1        | 20.0M/170M [00:00<00:00, 210MB/s]
     27%|##7       | 45.9M/170M [00:00<00:00, 246MB/s]
     41%|####      | 69.4M/170M [00:00<00:00, 236MB/s]
     56%|#####6    | 95.4M/170M [00:00<00:00, 250MB/s]
     71%|#######1  | 121M/170M [00:00<00:00, 256MB/s] 
     87%|########6 | 147M/170M [00:00<00:00, 262MB/s]
    100%|##########| 170M/170M [00:00<00:00, 255MB/s]
+
      0%|          | 0.00/170M [00:00<?, ?B/s]
      3%|2         | 4.44M/170M [00:00<00:03, 46.2MB/s]
      5%|5         | 8.84M/170M [00:00<00:04, 35.8MB/s]
      8%|8         | 13.6M/170M [00:00<00:03, 41.3MB/s]
     10%|#         | 17.7M/170M [00:00<00:05, 31.8MB/s]
     12%|#2        | 21.1M/170M [00:00<00:05, 29.3MB/s]
     14%|#4        | 24.1M/170M [00:00<00:06, 23.3MB/s]
     16%|#6        | 27.8M/170M [00:00<00:05, 26.7MB/s]
     18%|#8        | 30.9M/170M [00:01<00:05, 28.2MB/s]
     20%|#9        | 33.8M/170M [00:01<00:05, 26.9MB/s]
     22%|##2       | 37.7M/170M [00:01<00:04, 30.5MB/s]
     25%|##4       | 42.3M/170M [00:01<00:03, 35.2MB/s]
     29%|##8       | 48.5M/170M [00:01<00:02, 42.9MB/s]
     32%|###1      | 54.3M/170M [00:01<00:02, 47.9MB/s]
     35%|###4      | 59.2M/170M [00:01<00:02, 49.0MB/s]
     38%|###7      | 64.0M/170M [00:02<00:03, 32.5MB/s]
     40%|####      | 68.0M/170M [00:02<00:03, 33.7MB/s]
     43%|####2     | 72.2M/170M [00:02<00:02, 36.1MB/
 s]
     45%|####5     | 76.5M/170M [00:02<00:02, 38.3MB/s]
     48%|####7     | 81.2M/170M [00:02<00:02, 41.1MB/s]
     50%|#####     | 85.4M/170M [00:02<00:02, 35.1MB/s]
     53%|#####2    | 89.4M/170M [00:02<00:02, 36.5MB/s]
     56%|#####6    | 95.4M/170M [00:02<00:01, 43.5MB/s]
     59%|#####8    | 99.9M/170M [00:03<00:02, 26.2MB/s]
     62%|######2   | 105M/170M [00:03<00:02, 32.0MB/s] 
     65%|######5   | 111M/170M [00:03<00:01, 36.7MB/s]
     69%|######9   | 117M/170M [00:03<00:01, 44.5MB/s]
     72%|#######2  | 122M/170M [00:03<00:01, 46.6MB/s]
     76%|#######5  | 129M/170M [00:03<00:00, 51.3MB/s]
     79%|#######8  | 134M/170M [00:03<00:00, 45.8MB/s]
     82%|########1 | 139M/170M [00:04<00:00, 34.9MB/s]
     84%|########4 | 143M/170M [00:04<00:00, 37.1MB/s]
     87%|########7 | 148M/170M [00:04<00:00, 41.0MB/s]
     90%|######### | 154M/170M [00:04<00:00, 44.4MB/s]
     94%|#########3| 159M/170M [00:04<00:00, 48.0MB/s]
     97%|#########6| 164M/170M [00:04<00:00, 48.9MB/
 s]
    100%|##########| 170M/170M [00:04<00:00, 38.4MB/s]
     /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3878: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
       for i in range(dim)
     /usr/local/lib/python3.7/dist-packages/torchvision/models/detection/anchor_utils.py:127: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
@@ -292,7 +292,7 @@ Get boxes with score larger than 0.9
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  49.764 seconds)
+   **Total running time of the script:** ( 3 minutes  8.564 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_object_detection_pytorch.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
index f7b5f235a..fa2336341 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
@@ -232,7 +232,7 @@ training. Other models require a full post training calibration.
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /workspace/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
-
      0%|          | 0.00/13.6M [00:00<?, ?B/s]
    100%|##########| 13.6M/13.6M [00:00<00:00, 158MB/s]
+
      0%|          | 0.00/13.6M [00:00<?, ?B/s]
     32%|###2      | 4.37M/13.6M [00:00<00:00, 45.6MB/s]
     64%|######4   | 8.72M/13.6M [00:00<00:00, 41.9MB/s]
    100%|##########| 13.6M/13.6M [00:00<00:00, 59.2MB/s]
 
 
 
@@ -412,7 +412,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      90.1732      90.1488      90.6934      89.7875       0.1924   
+      90.4874      90.4642      90.9767      90.1628       0.2075   
                
 
 
@@ -461,7 +461,7 @@ TODO
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  6.625 seconds)
+   **Total running time of the script:** ( 1 minutes  10.849 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_prequantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
index df062a4ee..4944e21a9 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
@@ -439,7 +439,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      116.9648     116.7765     126.4263     115.8638      1.2465   
+      120.1767     120.1294     121.7392     119.6096      0.3312   
                
 
 
@@ -476,7 +476,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  58.070 seconds)
+   **Total running time of the script:** ( 1 minutes  59.496 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_prequantized_tflite.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
index df4c99387..be3813b38 100644
--- a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
@@ -255,7 +255,7 @@ We create a Relay VM to build and execute the model.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  32.420 seconds)
+   **Total running time of the script:** ( 1 minutes  46.615 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_quantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
index 909816063..297167415 100644
--- a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
@@ -158,7 +158,7 @@ Convert and compile model for CPU.
             data: None
       input_sym_arg_type = in_param.infer_type()[0]
     Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/ssd_512_resnet50_v1_voc-9c8b225a.zip...
-
      0%|          | 0/132723 [00:00<?, ?KB/s]
      3%|3         | 4190/132723 [00:00<00:03, 41834.71KB/s]
      9%|9         | 11988/132723 [00:00<00:01, 63079.75KB/s]
     14%|#3        | 18298/132723 [00:00<00:02, 42188.68KB/s]
     21%|##        | 27374/132723 [00:00<00:01, 56935.47KB/s]
     26%|##6       | 34595/132723 [00:00<00:01, 61555.74KB/s]
     33%|###2      | 43748/132723 [00:00<00:01, 70599.41KB/s]
     39%|###8      | 51319/132723 [00:00<00:01, 66518.64KB/s]
     45%|####5     | 60348/132723 [00:00<00:00, 73269.43KB/s]
     51%|#####1    | 68014/132723 [00:01<00:01, 64089.00KB/s]
     58%|#####8    | 77195/132723 [00:01<00:00, 71361.94KB/s]
     64%|######3   | 84741/132723 [00:01<00:00, 61955.22KB/s]
     71%|#######   | 93881/132723 [00:01<00:00, 69309.12KB/s]
     76%|#######6  | 101295/132723 [00:01<00:00, 63552.54KB/s]
     83%|########3 | 110495/132723 [00:01<00:00, 70742.83KB/s]
     89%|########8 | 117985/132723 [00:01<00:00, 67670.14KB/s]
     96%|########
 #5| 126937/132723 [00:01<00:00, 73416.29KB/s]
    100%|##########| 132723/132723 [00:01<00:00, 66870.19KB/s]
+
      0%|          | 0/132723 [00:00<?, ?KB/s]
      5%|4         | 6414/132723 [00:00<00:01, 64136.24KB/s]
     11%|#1        | 14697/132723 [00:00<00:01, 75127.29KB/s]
     17%|#6        | 22210/132723 [00:00<00:01, 62992.49KB/s]
     22%|##2       | 29748/132723 [00:00<00:01, 67369.88KB/s]
     28%|##7       | 36655/132723 [00:00<00:01, 60766.91KB/s]
     34%|###3      | 44871/132723 [00:00<00:01, 67185.44KB/s]
     40%|####      | 53219/132723 [00:00<00:01, 72071.74KB/s]
     46%|####6     | 61467/132723 [00:00<00:00, 75191.29KB/s]
     53%|#####2    | 69848/132723 [00:00<00:00, 77773.84KB/s]
     59%|#####8    | 77735/132723 [00:01<00:00, 77474.75KB/s]
     65%|######4   | 86172/132723 [00:01<00:00, 79529.54KB/s]
     71%|#######1  | 94542/132723 [00:01<00:00, 80773.09KB/s]
     77%|#######7  | 102662/132723 [00:01<00:00, 58593.07KB/s]
     84%|########3 | 110973/132723 [00:01<00:00, 64386.46KB/s]
     89%|########9 | 118189/132723 [00:01<00:00, 64512.58KB/s]
     95%|########
 #5| 126535/132723 [00:01<00:00, 69429.16KB/s]
    100%|##########| 132723/132723 [00:01<00:00, 70072.68KB/s]
 
 
 
@@ -241,7 +241,7 @@ Display result
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  28.615 seconds)
+   **Total running time of the script:** ( 2 minutes  37.180 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_ssd_gluoncv.py:
diff --git a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
index de85b67e7..02c202fff 100644
--- a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
@@ -5,22 +5,22 @@
 
 Computation times
 =================
-**10:46.088** total execution time for **how_to_deploy_models** files:
+**11:36.686** total execution time for **how_to_deploy_models** files:
 
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``) | 02:49.764 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``) | 03:08.564 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)                           | 02:28.615 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)                           | 02:37.180 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)           | 01:58.070 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)           | 01:59.496 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)                               | 01:32.420 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)                               | 01:46.615 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)                         | 01:06.625 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)                         | 01:10.849 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)                 | 00:28.550 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)                 | 00:30.609 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)                       | 00:22.039 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)                       | 00:23.367 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_deploy_models_deploy_sparse.py` (``deploy_sparse.py``)                                     | 00:00.006 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
index ad3330ee4..10d07b90b 100644
--- a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
@@ -476,7 +476,7 @@ First let us define two helper functions to get the mobilenet model and a cat im
 
  .. code-block:: none
 
-    Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zipf12df131-f718-4783-9b6f-352ac0f95f20 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
+    Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip01c920af-8621-4791-bb1d-d3f16afd15d3 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
 
 
 
@@ -590,7 +590,7 @@ Now, to actually convert the entire network, we have written `a pass in Relay <h
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-      Check failed: (lower) is false: Intrinsic lowering function for target llvm, intrinsic name tir.sqrt, type 150 not found
+      Check failed: (lower) is false: FloatImm lowering function for target llvm type 150 not found
 
 
 
diff --git a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
index 7d4a0ca82..6ff839fd7 100644
--- a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
@@ -5,14 +5,14 @@
 
 Computation times
 =================
-**00:39.665** total execution time for **how_to_extend_tvm** files:
+**00:42.211** total execution time for **how_to_extend_tvm** files:
 
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``) | 00:36.574 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``) | 00:38.909 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)           | 00:02.184 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)           | 00:02.316 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)                     | 00:00.900 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)                     | 00:00.978 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_extend_tvm_low_level_custom_pass.py` (``low_level_custom_pass.py``)       | 00:00.008 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
index 93d0fcda8..6a6637296 100644
--- a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
@@ -216,10 +216,10 @@ profile the execution time of each passes.
  .. code-block:: none
 
     Printing results of timing profile...
-    InferType: 8314us [8314us] (44.15%; 44.15%)
-    FoldScaleAxis: 10517us [13us] (55.85%; 55.85%)
-            FoldConstant: 10504us [2209us] (55.78%; 99.88%)
-                    InferType: 8296us [8296us] (44.05%; 78.97%)
+    InferType: 6732us [6732us] (45.68%; 45.68%)
+    FoldScaleAxis: 8004us [7us] (54.32%; 54.32%)
+            FoldConstant: 7997us [1641us] (54.27%; 99.91%)
+                    InferType: 6356us [6356us] (43.13%; 79.48%)
 
 
 
@@ -258,10 +258,10 @@ Refer to following sections and :py:func:`tvm.instrument.pass_instrument` for th
  .. code-block:: none
 
     Printing results of timing profile...
-    InferType: 7433us [7433us] (49.04%; 49.04%)
-    FoldScaleAxis: 7724us [4us] (50.96%; 50.96%)
-            FoldConstant: 7720us [1618us] (50.93%; 99.94%)
-                    InferType: 6102us [6102us] (40.26%; 79.04%)
+    InferType: 6545us [6545us] (43.81%; 43.81%)
+    FoldScaleAxis: 8397us [7us] (56.19%; 56.19%)
+            FoldConstant: 8389us [1687us] (56.15%; 99.91%)
+                    InferType: 6703us [6703us] (44.86%; 79.89%)
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
index 517de753c..60b200fe9 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
@@ -340,7 +340,7 @@ latency of convolution.
 
  .. code-block:: none
 
-    Convolution: 47.030861 ms
+    Convolution: 34.195594 ms
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
index d043dd30a..f271a33dd 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
@@ -671,7 +671,7 @@ be able to run on our build server
 
  .. code-block:: none
 
-    conv2d with tensor core: 10.383821 ms
+    conv2d with tensor core: 13.380213 ms
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
index 041250b54..c126e5848 100644
--- a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
@@ -143,8 +143,8 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
 
  .. code-block:: none
 
-    Numpy running time: 0.017734
-    Baseline: 3.304103
+    Numpy running time: 0.019290
+    Baseline: 3.314951
 
 
 
@@ -239,7 +239,7 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
 
  .. code-block:: none
 
-    Opt1: 0.298228
+    Opt1: 0.319761
 
 
 
@@ -342,7 +342,7 @@ In this tutorial, we chose to vectorize the inner loop row data since it is cach
 
  .. code-block:: none
 
-    Opt2: 0.333569
+    Opt2: 0.357257
 
 
 
@@ -438,7 +438,7 @@ the access pattern for A matrix is more cache friendly.
 
  .. code-block:: none
 
-    Opt3: 0.112827
+    Opt3: 0.121127
 
 
 
@@ -563,7 +563,7 @@ flattening.
 
  .. code-block:: none
 
-    Opt4: 0.110056
+    Opt4: 0.112509
 
 
 
@@ -685,7 +685,7 @@ write to C when all the block results are ready.
 
  .. code-block:: none
 
-    Opt5: 0.110797
+    Opt5: 0.112328
 
 
 
@@ -810,7 +810,7 @@ Futhermore, we can also utilize multi-core processors to do the thread-level par
 
  .. code-block:: none
 
-    Opt6: 0.144770
+    Opt6: 0.145362
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
index 7dde57970..76bc651ec 100644
--- a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
@@ -5,12 +5,12 @@
 
 Computation times
 =================
-**00:33.720** total execution time for **how_to_optimize_operators** files:
+**00:35.039** total execution time for **how_to_optimize_operators** files:
 
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)                       | 00:31.640 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)                       | 00:32.541 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``) | 00:01.183 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``) | 00:01.390 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)             | 00:00.896 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)             | 00:01.107 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
index 2d3bcb247..4a5bdcd15 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
@@ -5,18 +5,18 @@
 
 Computation times
 =================
-**05:53.101** total execution time for **how_to_tune_with_autoscheduler** files:
+**06:05.144** total execution time for **how_to_tune_with_autoscheduler** files:
 
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``) | 03:10.941 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``) | 03:16.712 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)             | 01:20.822 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)             | 01:23.975 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)           | 00:45.275 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)           | 00:46.934 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)               | 00:19.106 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)               | 00:19.548 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)           | 00:08.551 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)           | 00:09.047 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)             | 00:08.406 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)             | 00:08.929 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
index b579f6f73..a135bd055 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
@@ -240,65 +240,180 @@ cooperative fetching, unrolling and operator fusion.
                  compute: Buffer(compute_2: Pointer(float32), float32, [25088], [])}
       buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute}
       preflattened_buffer_map = {data_1: data_3: Buffer(data_2, float32, [1, 512, 7, 7], []), kernel_1: kernel_3: Buffer(kernel_2, float32, [512, 512, 3, 3], []), bias_1: bias_3: Buffer(bias_2, float32, [1, 512, 1, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [1, 512, 7, 7], [])} {
-      attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 32;
-      allocate(conv2d_nchw: Pointer(local float32), float32, [2]), storage_scope = local;
+      attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 16;
+      allocate(conv2d_nchw: Pointer(local float32), float32, [4]), storage_scope = local;
       allocate(pad_temp.shared: Pointer(shared float32), float32, [324]), storage_scope = shared;
-      allocate(kernel.shared: Pointer(shared float32), float32, [576]), storage_scope = shared;
+      allocate(kernel.shared: Pointer(shared float32), float32, [1152]), storage_scope = shared;
       attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392 {
         conv2d_nchw_1: Buffer(conv2d_nchw, float32, [1], [], scope="local", align=4)[0] = 0f32
         conv2d_nchw_1[1] = 0f32
+        conv2d_nchw_1[2] = 0f32
+        conv2d_nchw_1[3] = 0f32
         for (rc.outer.outer: int32, 0, 128) {
-          attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
-          if @tir.likely((threadIdx.x_1 < 324), dtype=bool) {
-            pad_temp.shared_1: Buffer(pad_temp.shared, float32, [324], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else(((((9 <= floormod(threadIdx.x_1, 81)) && (floormod(threadIdx.x_1, 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[(((((rc.outer.outer*196) + (floordiv(threadIdx.x_1, 81)*49)) + (floordiv(floormod(threadIdx.x_1, 81), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-          }
-          attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
-          kernel.shared_1: Buffer(kernel.shared, float32, [576], [], scope="shared")[threadIdx.x_2] = kernel[((((blockIdx.x*73728) + (floordiv(threadIdx.x_2, 36)*4608)) + (rc.outer.outer*36)) + floormod(threadIdx.x_2, 36))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
-          if @tir.likely((threadIdx.x_2 < 184), dtype=bool) {
-            kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 392), 36)*4608)) + (rc.outer.outer*36)) + (floordiv(floormod((threadIdx.x_2 + 32), 36), 9)*9)) + (floordiv(floormod((threadIdx.x_2 + 5), 9), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-          }
-          for (rc.outer.inner: int32, 0, 2) {
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18))]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 288)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 9)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 297)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 1)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 289)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 10)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 298)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 2)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 290)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 11)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 299)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 3)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 291)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 12)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 300)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 4)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 292)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 13)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 301)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 5)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 293)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 14)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 302)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 6)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 294)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 15)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 303)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 7)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 295)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 16)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 304)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 8)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 296)]))
-            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 17)]))
-            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 305)]))
+          let cse_var_1: int32 = (rc.outer.outer*36)
+           {
+            attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+            if @tir.likely((threadIdx.x_1 < 324), dtype=bool) {
+              pad_temp.shared_1: Buffer(pad_temp.shared, float32, [324], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else(((((9 <= floormod(threadIdx.x_1, 81)) && (floormod(threadIdx.x_1, 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[(((((rc.outer.outer*196) + (floordiv(threadIdx.x_1, 81)*49)) + (floordiv(floormod(threadIdx.x_1, 81), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
+            }
+            attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+            kernel.shared_1: Buffer(kernel.shared, float32, [1152], [], scope="shared")[threadIdx.x_2] = kernel[((((blockIdx.x*147456) + (floordiv(threadIdx.x_2, 36)*4608)) + cse_var_1) + floormod(threadIdx.x_2, 36))]
+            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+            kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[((((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 392), 36)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 32), 36), 9)*9)) + (floordiv(floormod((threadIdx.x_2 + 5), 9), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
+            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 392;
+            if @tir.likely((threadIdx.x_2 < 368), dtype=bool) {
+              kernel.shared_1[(threadIdx.x_2 + 784)] = kernel[(((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 784), 36)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 28), 36), 9)*9)) + floormod((threadIdx.x_2 + 1), 9))]
+            }
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7))]*kernel.shared_1[(floordiv(threadIdx.x, 49)*36)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7))]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 288)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7))]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 576)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7))]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 864)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 1)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 289)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 577)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 865)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 2)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 290)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 578)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 866)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 3)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 291)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 579)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 867)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 4)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 292)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 580)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 868)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 5)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 293)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 581)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 869)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 6)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 294)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 582)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 870)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 7)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 295)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 583)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 871)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 8)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 296)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 584)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 872)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 9)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 297)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 585)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 873)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 10)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 298)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 586)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 874)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 11)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 299)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 587)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 875)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 12)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 300)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 588)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 876)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 13)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 301)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 589)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 877)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 14)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 302)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 590)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 878)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 15)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 303)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 591)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 879)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 16)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 304)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 592)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 880)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 17)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 305)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 593)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 881)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 18)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 306)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 594)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 882)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 19)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 307)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 595)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 883)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 20)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 308)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 596)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 884)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 21)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 309)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 597)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 885)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 22)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 310)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 598)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 886)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 23)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 311)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 599)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 887)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 24)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 312)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 600)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 888)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 25)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 313)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 601)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 889)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 26)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 314)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 602)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 890)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 27)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 315)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 603)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 891)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 28)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 316)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 604)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 892)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 29)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 317)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 605)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 893)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 30)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 318)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 606)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 894)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 31)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 319)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 607)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 895)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 32)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 320)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 608)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 896)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 33)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 321)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 609)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 897)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 34)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 322)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 610)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 898)]))
+            conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 35)]))
+            conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 323)]))
+            conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 611)]))
+            conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 899)]))
           }
         }
-        compute[((blockIdx.x*784) + threadIdx.x)] = max((conv2d_nchw_1[0] + bias[((blockIdx.x*16) + floordiv(threadIdx.x, 49))]), 0f32)
-        compute[(((blockIdx.x*784) + threadIdx.x) + 392)] = max((conv2d_nchw_1[1] + bias[(((blockIdx.x*16) + floordiv(threadIdx.x, 49)) + 8)]), 0f32)
+        compute[((blockIdx.x*1568) + threadIdx.x)] = max((conv2d_nchw_1[0] + bias[((blockIdx.x*32) + floordiv(threadIdx.x, 49))]), 0f32)
+        compute[(((blockIdx.x*1568) + threadIdx.x) + 392)] = max((conv2d_nchw_1[1] + bias[(((blockIdx.x*32) + floordiv(threadIdx.x, 49)) + 8)]), 0f32)
+        compute[(((blockIdx.x*1568) + threadIdx.x) + 784)] = max((conv2d_nchw_1[2] + bias[(((blockIdx.x*32) + floordiv(threadIdx.x, 49)) + 16)]), 0f32)
+        compute[(((blockIdx.x*1568) + threadIdx.x) + 1176)] = max((conv2d_nchw_1[3] + bias[(((blockIdx.x*32) + floordiv(threadIdx.x, 49)) + 24)]), 0f32)
       }
     }
 
@@ -352,7 +467,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 0.313 ms
+    Execution time of this operator: 0.328 ms
 
 
 
@@ -403,7 +518,7 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=1)
     conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=1)
     conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=8)
-    conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=2)
+    conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=4)
     conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1)
     conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
     conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=7)
@@ -414,17 +529,17 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
     conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=2)
     conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=2)
-    conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=1)
-    conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=3)
-    conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=1)
-    conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=3)
+    conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=3)
+    conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
+    conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=3)
+    conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1)
     s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2 [...]
     compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
     compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
     compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
     compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=1)
     compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=8)
-    compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=2)
+    compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=4)
     compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
     compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
     compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
@@ -456,7 +571,7 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
     pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=392)
     s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
-    s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 64)
+    s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 1024)
     s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "unroll_explicit", True)
 
     CUDA source code:
@@ -475,62 +590,173 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
       #define uint64_t unsigned long long
     #endif
     extern "C" __global__ void __launch_bounds__(392) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
-      float conv2d_nchw[2];
+      float conv2d_nchw[4];
       __shared__ float pad_temp_shared[324];
-      __shared__ float kernel_shared[576];
+      __shared__ float kernel_shared[1152];
       conv2d_nchw[0] = 0.000000e+00f;
       conv2d_nchw[1] = 0.000000e+00f;
+      conv2d_nchw[2] = 0.000000e+00f;
+      conv2d_nchw[3] = 0.000000e+00f;
       for (int rc_outer_outer = 0; rc_outer_outer < 128; ++rc_outer_outer) {
         __syncthreads();
         if (((int)threadIdx.x) < 324) {
           pad_temp_shared[((int)threadIdx.x)] = (((((9 <= (((int)threadIdx.x) % 81)) && ((((int)threadIdx.x) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 196) + ((((int)threadIdx.x) / 81) * 49)) + (((((int)threadIdx.x) % 81) / 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
         }
-        kernel_shared[((int)threadIdx.x)] = kernel[((((((int)blockIdx.x) * 73728) + ((((int)threadIdx.x) / 36) * 4608)) + (rc_outer_outer * 36)) + (((int)threadIdx.x) % 36))];
-        if (((int)threadIdx.x) < 184) {
-          kernel_shared[(((int)threadIdx.x) + 392)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 392) / 36) * 4608)) + (rc_outer_outer * 36)) + ((((((int)threadIdx.x) + 32) % 36) / 9) * 9)) + ((((((int)threadIdx.x) + 5) % 9) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+        kernel_shared[((int)threadIdx.x)] = kernel[((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 36) * 4608)) + (rc_outer_outer * 36)) + (((int)threadIdx.x) % 36))];
+        kernel_shared[(((int)threadIdx.x) + 392)] = kernel[((((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 392) / 36) * 4608)) + (rc_outer_outer * 36)) + ((((((int)threadIdx.x) + 32) % 36) / 9) * 9)) + ((((((int)threadIdx.x) + 5) % 9) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+        if (((int)threadIdx.x) < 368) {
+          kernel_shared[(((int)threadIdx.x) + 784)] = kernel[(((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 784) / 36) * 4608)) + (rc_outer_outer * 36)) + ((((((int)threadIdx.x) + 28) % 36) / 9) * 9)) + ((((int)threadIdx.x) + 1) % 9))];
         }
         __syncthreads();
-        for (int rc_outer_inner = 0; rc_outer_inner < 2; ++rc_outer_inner) {
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18))]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 288)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 9)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 297)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 1)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 289)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 10)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 298)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 2)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 290)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 11)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 299)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 3)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 291)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 12)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 300)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 4)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 292)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 13)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 301)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 5)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 293)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 14)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 302)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 6)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 294)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 15)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 303)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 7)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 295)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 16)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 304)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 8)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 296)]));
-          conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 17)]));
-          conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 305)]));
-        }
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7))] * kernel_shared[((((int)threadIdx.x) / 49) * 36)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7))] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 288)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7))] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 576)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7))] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 864)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 1)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 289)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 577)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 865)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 2)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 290)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 578)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 866)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 3)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 291)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 579)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 867)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 4)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 292)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 580)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 868)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 5)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 293)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 581)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 869)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 6)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 294)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 582)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 870)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 7)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 295)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 583)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 871)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 8)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 296)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 584)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 872)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 9)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 297)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 585)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 873)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 10)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 298)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 586)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 874)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 11)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 299)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 587)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 875)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 12)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 300)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 588)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 876)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 13)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 301)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 589)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 877)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 14)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 302)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 590)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 878)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 15)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 303)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 591)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 879)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 16)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 304)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 592)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 880)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 17)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 305)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 593)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 881)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 18)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 306)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 594)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 882)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 19)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 307)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 595)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 883)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 20)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 308)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 596)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 884)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 21)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 309)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 597)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 885)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 22)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 310)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 598)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 886)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 23)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 311)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 599)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 887)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 24)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 312)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 600)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 888)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 25)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 313)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 601)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 889)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 26)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 314)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 602)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 890)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 27)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 315)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 603)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 891)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 28)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 316)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 604)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 892)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 29)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 317)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 605)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 893)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 30)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 318)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 606)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 894)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 31)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 319)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 607)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 895)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 32)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 320)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 608)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 896)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 33)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 321)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 609)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 897)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 34)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 322)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 610)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 898)]));
+        conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 35)]));
+        conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 323)]));
+        conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 611)]));
+        conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 899)]));
       }
-      compute[((((int)blockIdx.x) * 784) + ((int)threadIdx.x))] = max((conv2d_nchw[0] + bias[((((int)blockIdx.x) * 16) + (((int)threadIdx.x) / 49))]), 0.000000e+00f);
-      compute[(((((int)blockIdx.x) * 784) + ((int)threadIdx.x)) + 392)] = max((conv2d_nchw[1] + bias[(((((int)blockIdx.x) * 16) + (((int)threadIdx.x) / 49)) + 8)]), 0.000000e+00f);
+      compute[((((int)blockIdx.x) * 1568) + ((int)threadIdx.x))] = max((conv2d_nchw[0] + bias[((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 49))]), 0.000000e+00f);
+      compute[(((((int)blockIdx.x) * 1568) + ((int)threadIdx.x)) + 392)] = max((conv2d_nchw[1] + bias[(((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 49)) + 8)]), 0.000000e+00f);
+      compute[(((((int)blockIdx.x) * 1568) + ((int)threadIdx.x)) + 784)] = max((conv2d_nchw[2] + bias[(((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 49)) + 16)]), 0.000000e+00f);
+      compute[(((((int)blockIdx.x) * 1568) + ((int)threadIdx.x)) + 1176)] = max((conv2d_nchw[3] + bias[(((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 49)) + 24)]), 0.000000e+00f);
     }
 
 
@@ -591,7 +817,7 @@ In the example below we resume the status and do more 5 trials.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 3 minutes  10.941 seconds)
+   **Total running time of the script:** ( 3 minutes  16.712 seconds)
 
 
 .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
index 304bcbd3a..3ea431eef 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
@@ -647,7 +647,7 @@ so we can read the log file and load the best schedules.
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-       9.9161       9.9280       9.9583       9.8619       0.0403   
+       9.9457       9.9355      10.0108       9.8908       0.0496   
                
 
 
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
index e309b2dc6..a853aeabf 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
@@ -666,7 +666,7 @@ so we can read the log file and load the best schedules.
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      742.2515     742.3502     742.4684     741.9358      0.2284   
+      764.6261     764.9087     765.3120     763.6577      0.7043   
                
 
 
@@ -694,7 +694,7 @@ Other Tips
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  20.822 seconds)
+   **Total running time of the script:** ( 1 minutes  23.975 seconds)
 
 
 .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_x86.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
index 7f2244748..05bd4643b 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
@@ -397,30 +397,78 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
                  placeholder_4: Buffer(placeholder_14: Pointer(float32), float32, [65536], []),
                  compute: Buffer(compute_2: Pointer(float32), float32, [65536], [])}
       buffer_map = {placeholder_5: placeholder, placeholder_6: placeholder_1, placeholder_7: placeholder_2, placeholder_8: placeholder_3, placeholder_9: placeholder_4, compute_1: compute}
-      preflattened_buffer_map = {placeholder_6: placeholder_15: Buffer(placeholder_11, float32, [4916, 16, 1], []), placeholder_7: placeholder_16: Buffer(placeholder_12, int32, [4916], []), placeholder_5: placeholder_17: Buffer(placeholder_10, float32, [128, 256], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_9: placeholder_18: Buffer(placeholder_14, float32, [128, 512], []), placeholder_8: placeholder_19: Buffer(placeholder_13, int32, [33], [])} {
-      for (i0.outer.i1.outer.fused: int32, 0, 64) "parallel" {
-        allocate(compute_4: Pointer(global float32), float32, [1024]), storage_scope = global {
-          for (i.outer.inner: int32, 0, 8) {
+      preflattened_buffer_map = {placeholder_6: placeholder_15: Buffer(placeholder_11, float32, [4916, 16, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_5: placeholder_16: Buffer(placeholder_10, float32, [128, 256], []), placeholder_8: placeholder_17: Buffer(placeholder_13, int32, [33], []), placeholder_7: placeholder_18: Buffer(placeholder_12, int32, [4916], []), placeholder_9: placeholder_19: Buffer(placeholder_14, float32, [128, 512], [])} {
+      for (i0.outer.i1.outer.fused: int32, 0, 16) "parallel" {
+        allocate(compute_4: Pointer(global float32), float32, [4096]), storage_scope = global {
+          for (i.outer.inner: int32, 0, 4) {
             for (nb_j.inner: int32, 0, 2) {
-              for (i.inner.init: int32, 0, 4) {
-                for (j.init: int32, 0, 16) {
-                  compute_5: Buffer(compute_4, float32, [1024], [])[((((i.outer.inner*128) + (i.inner.init*32)) + (nb_j.inner*16)) + j.init)] = 0f32
+              for (i.inner.init: int32, 0, 32) {
+                let cse_var_1: int32 = (((i.outer.inner*1024) + (i.inner.init*32)) + (nb_j.inner*16))
+                 {
+                  compute_5: Buffer(compute_4, float32, [4096], [])[cse_var_1] = 0f32
+                  compute_5[(cse_var_1 + 1)] = 0f32
+                  compute_5[(cse_var_1 + 2)] = 0f32
+                  compute_5[(cse_var_1 + 3)] = 0f32
+                  compute_5[(cse_var_1 + 4)] = 0f32
+                  compute_5[(cse_var_1 + 5)] = 0f32
+                  compute_5[(cse_var_1 + 6)] = 0f32
+                  compute_5[(cse_var_1 + 7)] = 0f32
+                  compute_5[(cse_var_1 + 8)] = 0f32
+                  compute_5[(cse_var_1 + 9)] = 0f32
+                  compute_5[(cse_var_1 + 10)] = 0f32
+                  compute_5[(cse_var_1 + 11)] = 0f32
+                  compute_5[(cse_var_1 + 12)] = 0f32
+                  compute_5[(cse_var_1 + 13)] = 0f32
+                  compute_5[(cse_var_1 + 14)] = 0f32
+                  compute_5[(cse_var_1 + 15)] = 0f32
                 }
               }
-              for (elem_idx: int32, 0, let cse_var_1: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner) in (placeholder_3[(cse_var_1 + 1)] - placeholder_3[cse_var_1])) {
-                for (i.inner: int32, 0, 4) {
-                  for (j: int32, 0, 16) {
-                    let cse_var_3: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)
-                    let cse_var_2: int32 = ((((i.outer.inner*128) + (i.inner*32)) + (nb_j.inner*16)) + j)
-                    compute_5[cse_var_2] = (compute_5[cse_var_2] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + j)]*max(placeholder[((((floordiv(i0.outer.i1.outer.fused, 16)*8192) + (i.outer.inner*1024)) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
+              for (elem_idx: int32, 0, let cse_var_2: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner) in (placeholder_3[(cse_var_2 + 1)] - placeholder_3[cse_var_2])) {
+                for (i.inner: int32, 0, 32) {
+                  let cse_var_21: int32 = (elem_idx*16)
+                  let cse_var_20: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner)
+                  let cse_var_19: int32 = ((i.outer.inner*8192) + (i.inner*256))
+                  let cse_var_18: int32 = (((i.outer.inner*1024) + (i.inner*32)) + (nb_j.inner*16))
+                  let cse_var_17: int32 = (cse_var_18 + 9)
+                  let cse_var_16: int32 = (cse_var_18 + 8)
+                  let cse_var_15: int32 = (cse_var_18 + 7)
+                  let cse_var_14: int32 = (cse_var_18 + 6)
+                  let cse_var_13: int32 = (cse_var_18 + 5)
+                  let cse_var_12: int32 = (cse_var_18 + 4)
+                  let cse_var_11: int32 = (cse_var_18 + 3)
+                  let cse_var_10: int32 = (cse_var_18 + 2)
+                  let cse_var_9: int32 = (cse_var_18 + 15)
+                  let cse_var_8: int32 = (cse_var_18 + 14)
+                  let cse_var_7: int32 = (cse_var_18 + 13)
+                  let cse_var_6: int32 = (cse_var_18 + 12)
+                  let cse_var_5: int32 = (cse_var_18 + 11)
+                  let cse_var_4: int32 = (cse_var_18 + 10)
+                  let cse_var_3: int32 = (cse_var_18 + 1)
+                   {
+                    compute_5[cse_var_18] = (compute_5[cse_var_18] + (placeholder_1[((placeholder_3[cse_var_20]*16) + cse_var_21)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_3] = (compute_5[cse_var_3] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 1)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_10] = (compute_5[cse_var_10] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 2)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_11] = (compute_5[cse_var_11] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 3)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_12] = (compute_5[cse_var_12] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 4)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_13] = (compute_5[cse_var_13] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 5)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_14] = (compute_5[cse_var_14] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 6)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_15] = (compute_5[cse_var_15] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 7)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_16] = (compute_5[cse_var_16] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 8)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_17] = (compute_5[cse_var_17] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 9)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_4] = (compute_5[cse_var_4] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 10)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_5] = (compute_5[cse_var_5] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 11)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_6] = (compute_5[cse_var_6] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 12)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_7] = (compute_5[cse_var_7] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 13)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_8] = (compute_5[cse_var_8] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 14)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                    compute_5[cse_var_9] = (compute_5[cse_var_9] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 15)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
                   }
                 }
               }
             }
           }
-          for (i0.inner: int32, 0, 32) {
-            let cse_var_4: int32 = (((floordiv(i0.outer.i1.outer.fused, 16)*16384) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 16)*32))
-            compute[ramp(cse_var_4, 1, 32)] = max((compute_5[ramp((i0.inner*32), 1, 32)] + placeholder_4[ramp(cse_var_4, 1, 32)]), broadcast(0f32, 32))
+          for (i0.inner: int32, 0, 128) {
+            let cse_var_22: int32 = ((i0.inner*512) + (i0.outer.i1.outer.fused*32))
+            compute[ramp(cse_var_22, 1, 32)] = max((compute_5[ramp((i0.inner*32), 1, 32)] + placeholder_4[ramp(cse_var_22, 1, 32)]), broadcast(0f32, 32))
           }
         }
       }
@@ -476,7 +524,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 1.452 ms
+    Execution time of this operator: 1.744 ms
 
 
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
index d53e35860..80eb3fe0e 100644
--- a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
@@ -5,12 +5,12 @@
 
 Computation times
 =================
-**00:44.772** total execution time for **how_to_tune_with_autotvm** files:
+**00:46.154** total execution time for **how_to_tune_with_autotvm** files:
 
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)           | 00:44.741 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)           | 00:46.113 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)               | 00:00.016 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)               | 00:00.026 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_cuda.py` (``tune_relay_cuda.py``)             | 00:00.005 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
index f86442fc8..584e86137 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
@@ -1156,8 +1156,8 @@ for this template
     TimeoutError
 
             [('tile_f', [-1, 2, 1, 64]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4909501
-    No: 9   GFLOPS: 187.05/187.05   result: MeasureResult(costs=(0.001237657193548387,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.0630202293395996, timestamp=1659036206.659014)        [('tile_f', [-1, 1, 4, 8]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 2]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,5072689
-    No: 10  GFLOPS: 0.00/187.05     result: Traceback (most recent call last):
+    No: 9   GFLOPS: 80.74/80.74     result: MeasureResult(costs=(0.002867256942857143,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.8604607582092285, timestamp=1659038348.8080869)       [('tile_f', [-1, 1, 4, 8]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 2]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,5072689
+    No: 10  GFLOPS: 0.00/80.74      result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1280,8 +1280,8 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 4, 8]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 64, 2]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,5092711
-    No: 11  GFLOPS: 261.45/261.45   result: MeasureResult(costs=(0.0008854458826815641,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.7113349437713623, timestamp=1659036207.5683029)      [('tile_f', [-1, 8, 2, 1]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 2, 1]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4264713
-    No: 12  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+    No: 11  GFLOPS: 260.36/260.36   result: MeasureResult(costs=(0.0008891593646408842,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.494821548461914, timestamp=1659038349.724512)        [('tile_f', [-1, 8, 2, 1]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 2, 1]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4264713
+    No: 12  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1404,7 +1404,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 128, 1, 2]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 1, 256]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,183542
-    No: 13  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+    No: 13  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1527,7 +1527,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 8, 8]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 1, 64]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2482196
-    No: 14  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+    No: 14  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1650,9 +1650,9 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 64, 1, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 4, 2]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,10306226
-    No: 15  GFLOPS: 5.31/261.45     result: MeasureResult(costs=(0.04356790875,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.8453178405761719, timestamp=1659036212.0925148)      [('tile_f', [-1, 2, 2, 8]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 8]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,5330964
-    No: 16  GFLOPS: 3.35/261.45     result: MeasureResult(costs=(0.06905426975000001,), error_no=MeasureErrorNo.NO_ERROR, all_cost=4.511449813842773, timestamp=1659036213.3184597) [('tile_f', [-1, 8, 4, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2140058
-    No: 17  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+    No: 15  GFLOPS: 5.29/260.36     result: MeasureResult(costs=(0.04375233075,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.8625049591064453, timestamp=1659038354.3179812)      [('tile_f', [-1, 2, 2, 8]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 8]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,5330964
+    No: 16  GFLOPS: 3.34/260.36     result: MeasureResult(costs=(0.06939617725,), error_no=MeasureErrorNo.NO_ERROR, all_cost=4.593501091003418, timestamp=1659038355.5618432)       [('tile_f', [-1, 8, 4, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2140058
+    No: 17  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 142, in build
         res = future.result()
       File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
@@ -1670,8 +1670,8 @@ for this template
     TimeoutError
 
             [('tile_f', [-1, 2, 2, 1]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 16]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,10195251
-    No: 18  GFLOPS: 28.17/261.45    result: MeasureResult(costs=(0.008217329785714286,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.2581093311309814, timestamp=1659036224.3556004)       [('tile_f', [-1, 4, 8, 4]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 1, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6068603
-    No: 19  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+    No: 18  GFLOPS: 27.95/260.36    result: MeasureResult(costs=(0.008283954928571428,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.2748453617095947, timestamp=1659038366.575956)        [('tile_f', [-1, 4, 8, 4]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 1, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6068603
+    No: 19  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1794,7 +1794,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 16, 4, 8]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6956993
-    No: 20  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+    No: 20  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1973,7 +1973,7 @@ and measure running time.
     Best config:
     [('tile_f', [-1, 8, 2, 1]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 2, 1]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4264713
     Finish loading 20 records
-    Time cost of this operator: 0.001307
+    Time cost of this operator: 0.001256
 
 
 
diff --git a/docs/_sources/how_to/work_with_microtvm/index.rst.txt b/docs/_sources/how_to/work_with_microtvm/index.rst.txt
index 84b17b999..baf90783b 100644
--- a/docs/_sources/how_to/work_with_microtvm/index.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/index.rst.txt
@@ -15,6 +15,23 @@ demonstrate how to tune and deploy models with microTVM.
     <div class="sphx-glr-thumbnails">
 
 
+.. raw:: html
+
+    <div class="sphx-glr-thumbcontainer" tooltip="This tutorial is showcasing microTVM host-driven AoT compilation with a TFLite model. AoTExecut...">
+
+.. only:: html
+
+  .. image:: /how_to/work_with_microtvm/images/thumb/sphx_glr_micro_aot_thumb.png
+    :alt: microTVM Host-Driven AoT
+
+  :ref:`sphx_glr_how_to_work_with_microtvm_micro_aot.py`
+
+.. raw:: html
+
+      <div class="sphx-glr-thumbnail-title">microTVM Host-Driven AoT</div>
+    </div>
+
+
 .. raw:: html
 
     <div class="sphx-glr-thumbcontainer" tooltip="This tutorial explains how to autotune a model using the C runtime.">
@@ -125,6 +142,7 @@ demonstrate how to tune and deploy models with microTVM.
 .. toctree::
    :hidden:
 
+   /how_to/work_with_microtvm/micro_aot
    /how_to/work_with_microtvm/micro_autotune
    /how_to/work_with_microtvm/micro_ethosu
    /how_to/work_with_microtvm/micro_reference_vm
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_aot.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_aot.rst.txt
new file mode 100644
index 000000000..2e7ff7959
--- /dev/null
+++ b/docs/_sources/how_to/work_with_microtvm/micro_aot.rst.txt
@@ -0,0 +1,289 @@
+
+.. DO NOT EDIT.
+.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
+.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. "how_to/work_with_microtvm/micro_aot.py"
+.. LINE NUMBERS ARE GIVEN BELOW.
+
+.. only:: html
+
+    .. note::
+        :class: sphx-glr-download-link-note
+
+        Click :ref:`here <sphx_glr_download_how_to_work_with_microtvm_micro_aot.py>`
+        to download the full example code
+
+.. rst-class:: sphx-glr-example-title
+
+.. _sphx_glr_how_to_work_with_microtvm_micro_aot.py:
+
+
+.. _tutorial-micro-AoT:
+
+microTVM Host-Driven AoT
+===========================
+**Authors**:
+`Mehrdad Hessar <https://github.com/mehrdadh>`_,
+`Alan MacDonald <https://github.com/alanmacd>`_
+
+This tutorial is showcasing microTVM host-driven AoT compilation with
+a TFLite model. AoTExecutor reduces the overhead of parsing graph at runtime 
+compared to GraphExecutor. Also, we can have better memory management using ahead 
+of time compilation. This tutorial can be executed on a x86 CPU using C runtime (CRT)
+or on Zephyr platform on a microcontroller/board supported by Zephyr.
+
+.. GENERATED FROM PYTHON SOURCE LINES 32-44
+
+.. code-block:: default
+
+
+
+    import numpy as np
+    import pathlib
+    import json
+    import os
+
+    import tvm
+    from tvm import relay
+    from tvm.relay.backend import Executor, Runtime
+    from tvm.contrib.download import download_testdata
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 50-60
+
+Import a TFLite model
+---------------------
+
+To begin with, download and import a Keyword Spotting TFLite model.
+This model is originally from `MLPerf Tiny repository <https://github.com/mlcommons/tiny>`_.
+To test this model, we use samples from `KWS dataset provided by Google <https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html>`_.
+
+**Note:** By default this tutorial runs on x86 CPU using CRT, if you would like to run on Zephyr platform
+you need to export `TVM_MICRO_USE_HW` environment variable.
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 60-82
+
+.. code-block:: default
+
+    use_physical_hw = bool(os.getenv("TVM_MICRO_USE_HW"))
+    MODEL_URL = "https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/model/keyword_spotting_quant.tflite"
+    MODEL_PATH = download_testdata(MODEL_URL, "keyword_spotting_quant.tflite", module="model")
+    SAMPLE_URL = "https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/data/keyword_spotting_int8_6.pyc.npy"
+    SAMPLE_PATH = download_testdata(SAMPLE_URL, "keyword_spotting_int8_6.pyc.npy", module="data")
+
+    tflite_model_buf = open(MODEL_PATH, "rb").read()
+    try:
+        import tflite
+
+        tflite_model = tflite.Model.GetRootAsModel(tflite_model_buf, 0)
+    except AttributeError:
+        import tflite.Model
+
+        tflite_model = tflite.Model.Model.GetRootAsModel(tflite_model_buf, 0)
+
+    input_shape = (1, 49, 10, 1)
+    INPUT_NAME = "input_1"
+    relay_mod, params = relay.frontend.from_tflite(
+        tflite_model, shape_dict={INPUT_NAME: input_shape}, dtype_dict={INPUT_NAME: "int8"}
+    )
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 83-93
+
+Defining the target
+-------------------
+
+Now we need to define the target, runtime and executor. In this tutorial, we focused on
+using AOT host driven executor. We use the host micro target which is for running a model
+on x86 CPU using CRT runtime or running a model with Zephyr platform on qemu_x86 simulator
+board. In the case of a physical microcontroller, we get the target model for the physical
+board (E.g. nucleo_l4r5zi) and pass it to `tvm.target.target.micro` to create a full
+micro target.
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 93-111
+
+.. code-block:: default
+
+
+    # Use the C runtime (crt) and enable static linking by setting system-lib to True
+    RUNTIME = Runtime("crt", {"system-lib": True})
+
+    # Simulate a microcontroller on the host machine. Uses the main() from `src/runtime/crt/host/main.cc <https://github.com/apache/tvm/blob/main/src/runtime/crt/host/main.cc>`_.
+    # To use physical hardware, replace "host" with something matching your hardware.
+    TARGET = tvm.target.target.micro("host")
+
+    # Use the AOT executor rather than graph or vm executors. Don't use unpacked API or C calling style.
+    EXECUTOR = Executor("aot")
+
+    if use_physical_hw:
+        boards_file = pathlib.Path(tvm.micro.get_microtvm_template_projects("zephyr")) / "boards.json"
+        with open(boards_file) as f:
+            boards = json.load(f)
+        BOARD = os.getenv("TVM_MICRO_BOARD", default="nucleo_l4r5zi")
+        TARGET = tvm.target.target.micro(boards[BOARD]["model"])
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 112-117
+
+Compile the model
+-----------------
+
+Now, we compile the model for the target:
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 117-122
+
+.. code-block:: default
+
+    with tvm.transform.PassContext(opt_level=3, config={"tir.disable_vectorize": True}):
+        module = tvm.relay.build(
+            relay_mod, target=TARGET, params=params, runtime=RUNTIME, executor=EXECUTOR
+        )
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
+      "target_host parameter is going to be deprecated. "
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 123-131
+
+Create a microTVM project
+-------------------------
+
+Now that we have the compiled model as an IRModule, we need to create a firmware project
+to use the compiled model with microTVM. To do this, we use Project API. We have defined
+CRT and Zephyr microTVM template projects which are used for x86 CPU and Zephyr boards
+respectively.
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 131-144
+
+.. code-block:: default
+
+    template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects("crt"))
+    project_options = {}  # You can use options to provide platform-specific options through TVM.
+
+    if use_physical_hw:
+        template_project_path = pathlib.Path(tvm.micro.get_microtvm_template_projects("zephyr"))
+        project_options = {"project_type": "host_driven", "zephyr_board": BOARD}
+
+    temp_dir = tvm.contrib.utils.tempdir()
+    generated_project_dir = temp_dir / "project"
+    project = tvm.micro.generate_project(
+        template_project_path, module, generated_project_dir, project_options
+    )
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 145-153
+
+Build, flash and execute the model
+----------------------------------
+Next, we build the microTVM project and flash it. Flash step is specific to
+physical microcontrollers and it is skipped if it is simulating a microcontroller
+via the host main.cc or if a Zephyr emulated board is selected as the target.
+Next, we define the labels for the model output and execute the model with a
+sample with expected value of 6 (label: left).
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 153-181
+
+.. code-block:: default
+
+    project.build()
+    project.flash()
+
+    labels = [
+        "_silence_",
+        "_unknown_",
+        "yes",
+        "no",
+        "up",
+        "down",
+        "left",
+        "right",
+        "on",
+        "off",
+        "stop",
+        "go",
+    ]
+    with tvm.micro.Session(project.transport()) as session:
+        aot_executor = tvm.runtime.executor.aot_executor.AotModule(session.create_aot_executor())
+        sample = np.load(SAMPLE_PATH)
+        aot_executor.get_input(INPUT_NAME).copyfrom(sample)
+        aot_executor.run()
+        result = aot_executor.get_output(0).numpy()
+        print(f"Label is `{labels[np.argmax(result)]}` with index `{np.argmax(result)}`")
+    #
+    # Output:
+    # Label is `left` with index `6`
+    #
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    Label is `left` with index `6`
+
+
+
+
+
+.. _sphx_glr_download_how_to_work_with_microtvm_micro_aot.py:
+
+.. only:: html
+
+  .. container:: sphx-glr-footer sphx-glr-footer-example
+
+
+    .. container:: sphx-glr-download sphx-glr-download-python
+
+      :download:`Download Python source code: micro_aot.py <micro_aot.py>`
+
+    .. container:: sphx-glr-download sphx-glr-download-jupyter
+
+      :download:`Download Jupyter notebook: micro_aot.ipynb <micro_aot.ipynb>`
+
+
+.. only:: html
+
+ .. rst-class:: sphx-glr-signature
+
+    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
index 0d00e279d..d72e637e9 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
@@ -329,10 +329,10 @@ Timing the untuned program
     ########## Build without Autotuning ##########
     Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)  
     ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------  
-    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  311.2     98.678   (1, 2, 10, 10, 3)  2       1        [311.2]           
-    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.056     0.969    (1, 6, 10, 10)     1       1        [3.056]           
-    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         1.112     0.353    (1, 1, 10, 10, 3)  1       1        [1.112]           
-    Total_time                                    -                                             315.369   -        -                  -       -        -                 
+    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  311.1     98.724   (1, 2, 10, 10, 3)  2       1        [311.1]           
+    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.049     0.967    (1, 6, 10, 10)     1       1        [3.049]           
+    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.972     0.308    (1, 1, 10, 10, 3)  1       1        [0.972]           
+    Total_time                                    -                                             315.12    -        -                  -       -        -                 
 
 
 
@@ -398,10 +398,10 @@ Timing the tuned program
     ########## Build with Autotuning ##########
     Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)  
     ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------  
-    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  249.2     98.828   (1, 1, 10, 10, 6)  2       1        [249.2]           
-    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.987     0.788    (1, 6, 10, 10)     1       1        [1.987]           
-    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.969     0.384    (1, 1, 10, 10, 3)  1       1        [0.969]           
-    Total_time                                    -                                             252.156   -        -                  -       -        -                 
+    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  149.8     98.167   (1, 6, 10, 10, 1)  2       1        [149.8]           
+    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.815     1.189    (1, 6, 10, 10)     1       1        [1.815]           
+    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.982     0.644    (1, 1, 10, 10, 3)  1       1        [0.982]           
+    Total_time                                    -                                             152.597   -        -                  -       -        -                 
 
 
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
index 4488a03c1..c6757b017 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
@@ -225,7 +225,7 @@ take about **2 minutes** to download the Stanford Cars, while COCO 2017 validati
  .. code-block:: none
 
 
-    '/tmp/tmp56hx7b7g/images/random'
+    '/tmp/tmpkmqjaw7b/images/random'
 
 
 
@@ -325,8 +325,8 @@ objects to other stuff? We can display some examples from our datasets using ``m
 
  .. code-block:: none
 
-    /tmp/tmp56hx7b7g/images/target contains 8144 images
-    /tmp/tmp56hx7b7g/images/random contains 5000 images
+    /tmp/tmpkmqjaw7b/images/target contains 8144 images
+    /tmp/tmpkmqjaw7b/images/random contains 5000 images
 
 
 
@@ -501,13 +501,13 @@ the time on our validation set).
  .. code-block:: none
 
     Epoch 1/3
-    328/328 - 55s - loss: 0.2052 - accuracy: 0.9269 - val_loss: 0.1312 - val_accuracy: 0.9581
+    328/328 - 56s - loss: 0.2187 - accuracy: 0.9223 - val_loss: 0.1407 - val_accuracy: 0.9581
     Epoch 2/3
-    328/328 - 52s - loss: 0.0962 - accuracy: 0.9640 - val_loss: 0.1109 - val_accuracy: 0.9637
+    328/328 - 53s - loss: 0.0970 - accuracy: 0.9618 - val_loss: 0.1128 - val_accuracy: 0.9637
     Epoch 3/3
-    328/328 - 52s - loss: 0.0615 - accuracy: 0.9768 - val_loss: 0.1065 - val_accuracy: 0.9645
+    328/328 - 52s - loss: 0.0658 - accuracy: 0.9754 - val_loss: 0.0967 - val_accuracy: 0.9679
 
-    <keras.callbacks.History object at 0x7f76f644c450>
+    <keras.callbacks.History object at 0x7f86f8594d90>
 
 
 
@@ -864,7 +864,7 @@ Arduino tutorial for how to do that `on GitHub <https://github.com/guberti/tvm-a
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 5 minutes  8.568 seconds)
+   **Total running time of the script:** ( 5 minutes  10.753 seconds)
 
 
 .. _sphx_glr_download_how_to_work_with_microtvm_micro_train.py:
diff --git a/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
index 2a1bc2c3c..bcc0e4283 100644
--- a/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/sg_execution_times.rst.txt
@@ -5,14 +5,16 @@
 
 Computation times
 =================
-**05:53.166** total execution time for **how_to_work_with_microtvm** files:
+**06:05.900** total execution time for **how_to_work_with_microtvm** files:
 
 +---------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_microtvm_micro_train.py` (``micro_train.py``)               | 05:08.568 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_train.py` (``micro_train.py``)               | 05:10.753 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_microtvm_micro_autotune.py` (``micro_autotune.py``)         | 00:41.466 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_autotune.py` (``micro_autotune.py``)         | 00:43.431 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_microtvm_micro_tflite.py` (``micro_tflite.py``)             | 00:03.131 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_aot.py` (``micro_aot.py``)                   | 00:08.269 | 0.0 MB |
++---------------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_how_to_work_with_microtvm_micro_tflite.py` (``micro_tflite.py``)             | 00:03.446 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_work_with_microtvm_micro_ethosu.py` (``micro_ethosu.py``)             | 00:00.001 | 0.0 MB |
 +---------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
index f898f034b..4fd1dd139 100644
--- a/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_relay/sg_execution_times.rst.txt
@@ -5,14 +5,14 @@
 
 Computation times
 =================
-**00:37.909** total execution time for **how_to_work_with_relay** files:
+**00:38.828** total execution time for **how_to_work_with_relay** files:
 
 +----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_relay_using_pipeline_executor.py` (``using_pipeline_executor.py``) | 00:29.025 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_relay_using_pipeline_executor.py` (``using_pipeline_executor.py``) | 00:31.358 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_relay_using_external_lib.py` (``using_external_lib.py``)           | 00:07.643 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_relay_using_external_lib.py` (``using_external_lib.py``)           | 00:05.886 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_relay_build_gcn.py` (``build_gcn.py``)                             | 00:01.234 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_relay_build_gcn.py` (``build_gcn.py``)                             | 00:01.577 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_work_with_relay_using_relay_viz.py` (``using_relay_viz.py``)                 | 00:00.007 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/work_with_schedules/intrin_math.rst.txt b/docs/_sources/how_to/work_with_schedules/intrin_math.rst.txt
index aca61e4b9..8039f0a20 100644
--- a/docs/_sources/how_to/work_with_schedules/intrin_math.rst.txt
+++ b/docs/_sources/how_to/work_with_schedules/intrin_math.rst.txt
@@ -261,7 +261,7 @@ The following example customizes CUDA lowering rule for :code:`exp`.
  .. code-block:: none
 
 
-    <function my_cuda_math_rule at 0x7f76f63c03b0>
+    <function my_cuda_math_rule at 0x7f8679293f80>
 
 
 
diff --git a/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt b/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
index 09bf91656..57679c91e 100644
--- a/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/work_with_schedules/sg_execution_times.rst.txt
@@ -5,22 +5,22 @@
 
 Computation times
 =================
-**00:03.756** total execution time for **how_to_work_with_schedules** files:
+**00:04.455** total execution time for **how_to_work_with_schedules** files:
 
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_intrin_math.py` (``intrin_math.py``)                 | 00:01.711 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_intrin_math.py` (``intrin_math.py``)                 | 00:01.998 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_tensorize.py` (``tensorize.py``)                     | 00:00.916 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_tensorize.py` (``tensorize.py``)                     | 00:01.175 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_reduction.py` (``reduction.py``)                     | 00:00.484 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_reduction.py` (``reduction.py``)                     | 00:00.557 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_scan.py` (``scan.py``)                               | 00:00.468 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_scan.py` (``scan.py``)                               | 00:00.541 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_extern_op.py` (``extern_op.py``)                     | 00:00.097 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_extern_op.py` (``extern_op.py``)                     | 00:00.102 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_schedule_primitives.py` (``schedule_primitives.py``) | 00:00.040 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_schedule_primitives.py` (``schedule_primitives.py``) | 00:00.041 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_tedd.py` (``tedd.py``)                               | 00:00.026 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_tedd.py` (``tedd.py``)                               | 00:00.027 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_work_with_schedules_tuple_inputs.py` (``tuple_inputs.py``)               | 00:00.014 | 0.0 MB |
+| :ref:`sphx_glr_how_to_work_with_schedules_tuple_inputs.py` (``tuple_inputs.py``)               | 00:00.015 | 0.0 MB |
 +------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt b/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
index 15c6edc81..574466268 100644
--- a/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
+++ b/docs/_sources/how_to/work_with_schedules/tensorize.rst.txt
@@ -347,7 +347,7 @@ The importing needs to happen before the tensorized GEMV being executed.
                  C: Buffer(C_2: Pointer(float32), float32, [524288], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C}
       preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 64], []), B_1: B_3: Buffer(B_2, float32, [512, 64], []), C_1: C_3: Buffer(C_2, float32, [1024, 512], [])} {
-      attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmp2_2mr7z0/input0.cc'\nsource_filename = \"/tmp/tmp2_2mr7z0/input0.cc\"\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = alloca float*, align 8\n  %8 = alloca float*, align 8\n  %9 = alloca floa [...]
+      attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmphxqkmtky/input0.cc'\nsource_filename = \"/tmp/tmphxqkmtky/input0.cc\"\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = alloca float*, align 8\n  %8 = alloca float*, align 8\n  %9 = alloca floa [...]
       for (i, 0, 1024) {
         for (j.outer: int32, 0, 32) {
           @tir.call_extern("gemv_update", @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), C_2, ((i*512) + (j.outer*16)), 16, 2, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), A_2, (i*64), 64, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), B_2, (j.outer*1024), 1024, 1, dtype=handle), 16, 64, 64, dtype=int32)
diff --git a/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
index 6bfeaa0a1..5029871e0 100644
--- a/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/autotvm/sg_execution_times.rst.txt
@@ -5,10 +5,10 @@
 
 Computation times
 =================
-**00:20.586** total execution time for **topic_vta_tutorials_autotvm** files:
+**00:21.805** total execution time for **topic_vta_tutorials_autotvm** files:
 
 +---------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_relay_vta.py` (``tune_relay_vta.py``) | 00:20.579 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_relay_vta.py` (``tune_relay_vta.py``) | 00:21.799 | 0.0 MB |
 +---------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_topic_vta_tutorials_autotvm_tune_alu_vta.py` (``tune_alu_vta.py``)     | 00:00.006 | 0.0 MB |
 +---------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
index 15bfaf44a..140ed471c 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/deploy_classification.rst.txt
@@ -291,7 +291,7 @@ The compilation steps are:
       DeprecationWarning,
     /workspace/vta/tutorials/frontend/deploy_classification.py:213: DeprecationWarning: legacy graph executor behavior of producing json / lib / params will be removed in the next release. Please see documents of tvm.contrib.graph_executor.GraphModule for the  new recommended usage.
       relay_prog, target=tvm.target.Target(target, host=env.target_host), params=params
-    resnet18_v1 inference graph built in 22.09s!
+    resnet18_v1 inference graph built in 23.75s!
 
 
 
diff --git a/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
index 86d127cc2..bf831f9aa 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/deploy_detection.rst.txt
@@ -335,7 +335,7 @@ The compilation steps are:
       "target_host parameter is going to be deprecated. "
     /workspace/python/tvm/relay/build_module.py:411: DeprecationWarning: Please use input parameter mod (tvm.IRModule) instead of deprecated parameter mod (tvm.relay.function.Function)
       DeprecationWarning,
-    yolov3-tiny inference graph built in 15.66s!
+    yolov3-tiny inference graph built in 16.48s!
 
 
 
diff --git a/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
index ab55fb762..b77023e07 100644
--- a/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/frontend/sg_execution_times.rst.txt
@@ -5,10 +5,10 @@
 
 Computation times
 =================
-**01:30.205** total execution time for **topic_vta_tutorials_frontend** files:
+**01:33.558** total execution time for **topic_vta_tutorials_frontend** files:
 
 +------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_detection.py` (``deploy_detection.py``)           | 00:48.071 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_detection.py` (``deploy_detection.py``)           | 00:49.428 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_classification.py` (``deploy_classification.py``) | 00:42.134 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_frontend_deploy_classification.py` (``deploy_classification.py``) | 00:44.131 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
index 7120a36a2..2989d3bdf 100644
--- a/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/optimize/sg_execution_times.rst.txt
@@ -5,10 +5,10 @@
 
 Computation times
 =================
-**00:03.132** total execution time for **topic_vta_tutorials_optimize** files:
+**00:03.293** total execution time for **topic_vta_tutorials_optimize** files:
 
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_optimize_convolution_opt.py` (``convolution_opt.py``)         | 00:02.781 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_optimize_convolution_opt.py` (``convolution_opt.py``)         | 00:02.864 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_optimize_matrix_multiply_opt.py` (``matrix_multiply_opt.py``) | 00:00.351 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_optimize_matrix_multiply_opt.py` (``matrix_multiply_opt.py``) | 00:00.429 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt b/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
index 1aec0594c..65ce81639 100644
--- a/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
+++ b/docs/_sources/topic/vta/tutorials/sg_execution_times.rst.txt
@@ -5,10 +5,10 @@
 
 Computation times
 =================
-**00:00.622** total execution time for **topic_vta_tutorials** files:
+**00:00.781** total execution time for **topic_vta_tutorials** files:
 
 +---------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_matrix_multiply.py` (``matrix_multiply.py``) | 00:00.332 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_matrix_multiply.py` (``matrix_multiply.py``) | 00:00.420 | 0.0 MB |
 +---------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_topic_vta_tutorials_vta_get_started.py` (``vta_get_started.py``) | 00:00.289 | 0.0 MB |
+| :ref:`sphx_glr_topic_vta_tutorials_vta_get_started.py` (``vta_get_started.py``) | 00:00.361 | 0.0 MB |
 +---------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt b/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
index 735124c29..50166b7ee 100644
--- a/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
+++ b/docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt
@@ -328,7 +328,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 93.273 ms
+    Execution time of this operator: 95.801 ms
 
 
 
@@ -428,7 +428,7 @@ resume the status and do more 5 trials.
     Resume search:
     /usr/local/lib/python3.7/dist-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated.  See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html
       warnings.warn(f'Old style callback is deprecated.  See: {link}', UserWarning)
-    *E*E
+
 
 
 
@@ -446,7 +446,7 @@ operations.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  19.752 seconds)
+   **Total running time of the script:** ( 1 minutes  2.489 seconds)
 
 
 .. _sphx_glr_download_tutorial_auto_scheduler_matmul_x86.py:
diff --git a/docs/_sources/tutorial/autotvm_matmul_x86.rst.txt b/docs/_sources/tutorial/autotvm_matmul_x86.rst.txt
index 53ffd7a41..de7214fed 100644
--- a/docs/_sources/tutorial/autotvm_matmul_x86.rst.txt
+++ b/docs/_sources/tutorial/autotvm_matmul_x86.rst.txt
@@ -462,16 +462,16 @@ reduce variance, we take 5 measurements and average them.
     waiting for device...
     device available
     Get devices for measurement successfully!
-    No: 1   GFLOPS: 9.94/9.94       result: MeasureResult(costs=(0.0269965032,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5648865699768066, timestamp=1659035020.8669958)       [('tile_y', [-1, 1]), ('tile_x', [-1, 256])],None,80
-    No: 2   GFLOPS: 2.82/9.94       result: MeasureResult(costs=(0.0952838752,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6697702407836914, timestamp=1659035022.554008)        [('tile_y', [-1, 4]), ('tile_x', [-1, 8])],None,32
-    No: 3   GFLOPS: 11.85/11.85     result: MeasureResult(costs=(0.022660449,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5570068359375, timestamp=1659035023.6011431)   [('tile_y', [-1, 64]), ('tile_x', [-1, 32])],None,56
-    No: 4   GFLOPS: 1.65/11.85      result: MeasureResult(costs=(0.16307870060000001,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.727090358734131, timestamp=1659035026.889413)  [('tile_y', [-1, 1]), ('tile_x', [-1, 4])],None,20
-    No: 5   GFLOPS: 3.66/11.85      result: MeasureResult(costs=(0.0732470756,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.3117024898529053, timestamp=1659035028.3205142)       [('tile_y', [-1, 256]), ('tile_x', [-1, 16])],None,48
-    No: 6   GFLOPS: 1.76/11.85      result: MeasureResult(costs=(0.152845861,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.5684256553649902, timestamp=1659035031.4550264)        [('tile_y', [-1, 512]), ('tile_x', [-1, 4])],None,29
-    No: 7   GFLOPS: 0.86/11.85      result: MeasureResult(costs=(0.3130899754,), error_no=MeasureErrorNo.NO_ERROR, all_cost=5.129064559936523, timestamp=1659035036.6268103)        [('tile_y', [-1, 512]), ('tile_x', [-1, 2])],None,19
-    No: 8   GFLOPS: 10.57/11.85     result: MeasureResult(costs=(0.025399391799999997,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5462801456451416, timestamp=1659035037.19481) [('tile_y', [-1, 4]), ('tile_x', [-1, 64])],None,62
-    No: 9   GFLOPS: 1.62/11.85      result: MeasureResult(costs=(0.16565159599999998,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.748063087463379, timestamp=1659035040.0614984) [('tile_y', [-1, 2]), ('tile_x', [-1, 2])],None,11
-    No: 10  GFLOPS: 2.35/11.85      result: MeasureResult(costs=(0.114151147,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.931006669998169, timestamp=1659035042.0504346) [('tile_y', [-1, 4]), ('tile_x', [-1, 4])],None,22
+    No: 1   GFLOPS: 10.24/10.24     result: MeasureResult(costs=(0.0262166112,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5573618412017822, timestamp=1659037111.9611266)       [('tile_y', [-1, 1]), ('tile_x', [-1, 256])],None,80
+    No: 2   GFLOPS: 2.96/10.24      result: MeasureResult(costs=(0.0906545668,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6005101203918457, timestamp=1659037114.1158283)       [('tile_y', [-1, 4]), ('tile_x', [-1, 8])],None,32
+    No: 3   GFLOPS: 11.81/11.81     result: MeasureResult(costs=(0.022734837200000003,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5972039699554443, timestamp=1659037114.6895661)       [('tile_y', [-1, 64]), ('tile_x', [-1, 32])],None,56
+    No: 4   GFLOPS: 1.86/11.81      result: MeasureResult(costs=(0.1439960192,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.4247968196868896, timestamp=1659037117.700718)        [('tile_y', [-1, 1]), ('tile_x', [-1, 4])],None,20
+    No: 5   GFLOPS: 3.65/11.81      result: MeasureResult(costs=(0.0736059658,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.31510591506958, timestamp=1659037119.148282)  [('tile_y', [-1, 256]), ('tile_x', [-1, 16])],None,48
+    No: 6   GFLOPS: 1.82/11.81      result: MeasureResult(costs=(0.1476827578,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.489528179168701, timestamp=1659037122.2242532)        [('tile_y', [-1, 512]), ('tile_x', [-1, 4])],None,29
+    No: 7   GFLOPS: 0.87/11.81      result: MeasureResult(costs=(0.30776385300000003,), error_no=MeasureErrorNo.NO_ERROR, all_cost=5.049600601196289, timestamp=1659037127.315)     [('tile_y', [-1, 512]), ('tile_x', [-1, 2])],None,19
+    No: 8   GFLOPS: 10.74/11.81     result: MeasureResult(costs=(0.0250025322,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5460166931152344, timestamp=1659037127.8792002)       [('tile_y', [-1, 4]), ('tile_x', [-1, 64])],None,62
+    No: 9   GFLOPS: 1.91/11.81      result: MeasureResult(costs=(0.1407600762,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.353973627090454, timestamp=1659037130.3529673)        [('tile_y', [-1, 2]), ('tile_x', [-1, 2])],None,11
+    No: 10  GFLOPS: 2.80/11.81      result: MeasureResult(costs=(0.095986646,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6423468589782715, timestamp=1659037132.0528631)        [('tile_y', [-1, 4]), ('tile_x', [-1, 4])],None,22
 
 
 
diff --git a/docs/_sources/tutorial/autotvm_relay_x86.rst.txt b/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
index f370455ea..c7ad48fc0 100644
--- a/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
+++ b/docs/_sources/tutorial/autotvm_relay_x86.rst.txt
@@ -327,7 +327,7 @@ standard deviation.
 
  .. code-block:: none
 
-    {'mean': 490.9751048400267, 'median': 490.97516415004065, 'std': 0.751160590199003}
+    {'mean': 499.76453579998633, 'median': 499.21146484994097, 'std': 1.7238883331097248}
 
 
 
@@ -563,30 +563,30 @@ the tuning data to.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-
    [Task  1/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  1/25]  Current/Best:   17.28/  17.28 GFLOPS | Progress: (4/20) | 5.76 s
    [Task  1/25]  Current/Best:    6.15/  17.28 GFLOPS | Progress: (8/20) | 9.23 s
    [Task  1/25]  Current/Best:   11.57/  22.70 GFLOPS | Progress: (12/20) | 11.71 s
    [Task  1/25]  Current/Best:   16.76/  22.84 GFLOPS | Progress: (16/20) | 13.39 s
    [Task  1/25]  Current/Best:   11.64/  23.88 GFLOPS | Progress: (20/20) | 15.13 s Done.
-
    [Task  2/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  2/25]  Current/Best:   12.25/  13.33 GFLOPS | Progress: (4/20) | 3.84 s
    [Task  2/25]  Current/Best:   14.17/  18.67 GFLOPS | Progress: (8/20) | 5.16 s
    [Task  2/25]  Current/Best:   21.17/  21.17 GFLOPS | Progress: (12/20) | 6.47 s
    [Task  2/25]  Current/Best:   12.27/  21.17 GFLOPS | Progress: (16/20) | 7.73 s
    [Task  2/25]  Current/Best:   19.42/  21.17 GFLOPS | Progress: (20/20) | 9.34 s Done.
-
    [Task  3/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  3/25]  Current/Best:    1.63/  10.54 GFLOPS | Progress: (4/20) | 5.86 s
    [Task  3/25]  Current/Best:   15.60/  16.92 GFLOPS | Progress: (8/20) | 7.79 s
    [Task  3/25]  Current/Best:   14.90/  16.92 GFLOPS | Progress: (12/20) | 9.51 s
    [Task  3/25]  Current/Best:    7.20/  23.84 GFLOPS | Progress: (16/20) | 11.42 s
    [Task  3/25]  Current/Best:   12.69/  23.84 GFLOPS | Progress: (20/20) | 15.93 s Done.
-
    [Task  4/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  4/25]  Current/Best:    9.46/  20.33 GFLOPS | Progress: (4/20) | 2.37 s
    [Task  4/25]  Current/Best:    6.86/  20.33 GFLOPS | Progress: (8/20) | 6.97 s
    [Task  4/25]  Current/Best:   22.37/  22.37 GFLOPS | Progress: (12/20) | 11.84 s
    [Task  4/25]  Current/Best:   17.39/  22.37 GFLOPS | Progress: (16/20) | 14.19 s
    [Task  4/25]  Current/Best:   13.41/  22.37 GFLOPS | Progress: (20/20) | 16.22 s Done.
-
    [Task  5/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  5/25]  Current/Best:    9.74/  10.45 GFLOPS | Progress: (4/20) | 2.58 s
    [Task  5/25]  Current/Best:   11.83/  12.80 GFLOPS | Progress: (8/20) | 4.62 s
    [Task  5/25]  Current/Best:   11.86/  18.15 GFLOPS | Progress: (12/20) | 7.79 s
    [Task  5/25]  Current/Best:   11.87/  22.73 GFLOPS | Progress: (16/20) | 9.23 s
    [Task  5/25]  Current/Best:   12.11/  22.73 GFLOPS | Progress: (20/20) | 11.16 s Done.
-
    [Task  6/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  6/25]  Current/Best:   12.16/  20.69 GFLOPS | Progress: (4/20) | 4.07 s
    [Task  6/25]  Current/Best:   19.08/  20.69 GFLOPS | Progress: (8/20) | 5.83 s
    [Task  6/25]  Current/Best:   13.36/  20.69 GFLOPS | Progress: (12/20) | 7.75 s
    [Task  6/25]  Current/Best:   19.94/  20.69 GFLOPS | Progress: (16/20) | 9.96 s
    [Task  6/25]  Current/Best:    3.73/  20.69 GFLOPS | Progress: (20/20) | 12.48 s Done.
-
    [Task  7/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  7/25]  Current/Best:   11.29/  12.21 GFLOPS | Progress: (4/20) | 3.60 s
    [Task  7/25]  Current/Best:   20.33/  21.16 GFLOPS | Progress: (8/20) | 5.10 s
    [Task  7/25]  Current/Best:   16.11/  21.16 GFLOPS | Progress: (12/20) | 7.01 s
    [Task  7/25]  Current/Best:   12.30/  21.16 GFLOPS | Progress: (16/20) | 9.03 s
    [Task  7/25]  Current/Best:    6.44/  21.82 GFLOPS | Progress: (20/20) | 11.47 s Done.
-
    [Task  8/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  8/25]  Current/Best:    9.80/  13.91 GFLOPS | Progress: (4/20) | 2.93 s
    [Task  8/25]  Current/Best:    9.47/  13.91 GFLOPS | Progress: (8/20) | 8.01 s
    [Task  8/25]  Current/Best:   12.43/  13.91 GFLOPS | Progress: (12/20) | 14.33 s
    [Task  8/25]  Current/Best:   18.82/  18.82 GFLOPS | Progress: (16/20) | 16.40 s
    [Task  8/25]  Current/Best:   20.07/  20.07 GFLOPS | Progress: (20/20) | 23.33 s Done.
-
    [Task  9/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  9/25]  Current/Best:   14.37/  15.76 GFLOPS | Progress: (4/20) | 11.93 s
    [Task  9/25]  Current/Best:   23.38/  23.38 GFLOPS | Progress: (8/20) | 13.65 s
    [Task  9/25]  Current/Best:    8.27/  23.38 GFLOPS | Progress: (12/20) | 16.16 s
    [Task  9/25]  Current/Best:   17.96/  23.38 GFLOPS | Progress: (16/20) | 18.99 s
    [Task  9/25]  Current/Best:    9.27/  23.38 GFLOPS | Progress: (20/20) | 27.44 s
    [Task 10/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 10/25]  Current/Best:   18.34/  18.34 GFLOPS | Progress: (4/20) | 2.56 s
    [Task 10/25]  Current/Best:   15.44/  18.34 GFLOPS | Progress: (8/20) | 4.17 s
    [Task 10/25]  Current/Best:   12.75/  18.95 GFLOPS | Progress: (12/20) | 5.69 s
    [Task 10/25]  Current/Best:   19.11/  20.25 GFLOPS | Progress: (16/20) | 6.79 s
    [Task 10/25]  Current/Best:    8.83/  20.25 GFLOPS | Progress: (20/20
 ) | 8.30 s Done.
-
    [Task 11/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 11/25]  Current/Best:   12.32/  18.10 GFLOPS | Progress: (4/20) | 3.28 s
    [Task 11/25]  Current/Best:   16.89/  18.10 GFLOPS | Progress: (8/20) | 6.05 s
    [Task 11/25]  Current/Best:   18.29/  18.29 GFLOPS | Progress: (12/20) | 8.07 s
    [Task 11/25]  Current/Best:   13.49/  21.24 GFLOPS | Progress: (16/20) | 10.99 s
    [Task 11/25]  Current/Best:   19.44/  21.58 GFLOPS | Progress: (20/20) | 13.06 s Done.
-
    [Task 12/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 12/25]  Current/Best:    7.84/  17.90 GFLOPS | Progress: (4/20) | 5.56 s
    [Task 12/25]  Current/Best:    5.27/  17.90 GFLOPS | Progress: (8/20) | 9.44 s
    [Task 12/25]  Current/Best:   18.91/  18.91 GFLOPS | Progress: (12/20) | 11.43 s
    [Task 12/25]  Current/Best:   15.52/  18.91 GFLOPS | Progress: (16/20) | 14.29 s
    [Task 12/25]  Current/Best:   15.11/  18.91 GFLOPS | Progress: (20/20) | 16.19 s Done.
-
    [Task 13/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 13/25]  Current/Best:    8.73/  17.34 GFLOPS | Progress: (4/20) | 3.69 s
    [Task 13/25]  Current/Best:   15.89/  21.22 GFLOPS | Progress: (8/20) | 6.28 s
    [Task 13/25]  Current/Best:   19.69/  21.69 GFLOPS | Progress: (12/20) | 9.34 s
    [Task 13/25]  Current/Best:   12.31/  21.69 GFLOPS | Progress: (16/20) | 12.70 s
    [Task 13/25]  Current/Best:   18.64/  21.69 GFLOPS | Progress: (20/20) | 15.00 s Done.
-
    [Task 14/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 14/25]  Current/Best:   13.63/  13.63 GFLOPS | Progress: (4/20) | 3.31 s
    [Task 14/25]  Current/Best:    6.11/  13.63 GFLOPS | Progress: (8/20) | 5.50 s
    [Task 14/25]  Current/Best:   20.66/  20.66 GFLOPS | Progress: (12/20) | 8.18 s
    [Task 14/25]  Current/Best:   17.04/  20.66 GFLOPS | Progress: (16/20) | 9.81 s Done.
-
    [Task 14/25]  Current/Best:   17.07/  20.66 GFLOPS | Progress: (20/20) | 11.55 s
    [Task 15/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 15/25]  Current/Best:   16.18/  17.63 GFLOPS | Progress: (4/20) | 2.71 s
    [Task 15/25]  Current/Best:   13.28/  18.07 GFLOPS | Progress: (8/20) | 4.05 s
    [Task 15/25]  Current/Best:   10.26/  22.37 GFLOPS | Progress: (12/20) | 6.26 s
    [Task 15/25]  Current/Best:   20.45/  22.37 GFLOPS | Progress: (16/20) | 9.50 s
    [Task 15/25]  Current/Best:    9.66/  22.37 GFLOPS | Progress: (20/20) | 10.51 s
    [Task 16/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 16/25]  Current/Best:   19.13/  19.13 GFLOPS | Progress: (4/20) | 2.90 s
    [Task 16/25]  Current/Best:    3.04/  19.13 GFLOPS | Progress: (8/20) | 4.50 s
    [Task 16/25]  Current/Best:   19.53/  19.53 GFLOPS | Progress: (12/20) | 5.70 s
    [Task 16/25]  Current/Best:   17.96/  19.53 GFLOPS | Progress: (16/20) |
  7.04 s
    [Task 16/25]  Current/Best:   10.04/  19.53 GFLOPS | Progress: (20/20) | 9.18 s Done.
-
    [Task 17/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 17/25]  Current/Best:   13.74/  18.80 GFLOPS | Progress: (4/20) | 4.75 s
    [Task 17/25]  Current/Best:   12.69/  23.45 GFLOPS | Progress: (8/20) | 7.52 s
    [Task 17/25]  Current/Best:   16.88/  23.45 GFLOPS | Progress: (12/20) | 9.56 s
    [Task 17/25]  Current/Best:   16.53/  23.45 GFLOPS | Progress: (16/20) | 11.74 s
    [Task 17/25]  Current/Best:   10.06/  23.45 GFLOPS | Progress: (20/20) | 13.88 s Done.
-
    [Task 18/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 18/25]  Current/Best:   11.28/  17.11 GFLOPS | Progress: (4/20) | 3.77 s
    [Task 18/25]  Current/Best:   10.58/  17.11 GFLOPS | Progress: (8/20) | 7.43 s
    [Task 18/25]  Current/Best:   19.41/  19.41 GFLOPS | Progress: (12/20) | 9.38 s
    [Task 18/25]  Current/Best:   10.12/  19.41 GFLOPS | Progress: (16/20) | 13.21 s
    [Task 18/25]  Current/Best:   20.82/  20.82 GFLOPS | Progress: (20/20) | 14.73 s Done.
-
    [Task 19/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 19/25]  Current/Best:    7.27/  20.53 GFLOPS | Progress: (4/20) | 6.01 s
    [Task 19/25]  Current/Best:    2.61/  20.53 GFLOPS | Progress: (8/20) | 9.41 s
    [Task 19/25]  Current/Best:   20.27/  22.00 GFLOPS | Progress: (12/20) | 12.36 s
    [Task 19/25]  Current/Best:   13.91/  22.00 GFLOPS | Progress: (16/20) | 15.35 s
    [Task 19/25]  Current/Best:    2.70/  23.88 GFLOPS | Progress: (20/20) | 18.20 s Done.
-
    [Task 20/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 20/25]  Current/Best:    9.35/  15.39 GFLOPS | Progress: (4/20) | 3.29 s Done.
+
    [Task  1/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  1/25]  Current/Best:   17.35/  17.35 GFLOPS | Progress: (4/20) | 6.35 s
    [Task  1/25]  Current/Best:    6.15/  17.35 GFLOPS | Progress: (8/20) | 9.39 s
    [Task  1/25]  Current/Best:   11.53/  22.65 GFLOPS | Progress: (12/20) | 11.86 s
    [Task  1/25]  Current/Best:   16.68/  22.65 GFLOPS | Progress: (16/20) | 13.55 s
    [Task  1/25]  Current/Best:   11.57/  23.88 GFLOPS | Progress: (20/20) | 15.29 s Done.
+
    [Task  2/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  2/25]  Current/Best:   12.15/  13.07 GFLOPS | Progress: (4/20) | 3.94 s
    [Task  2/25]  Current/Best:   13.98/  18.52 GFLOPS | Progress: (8/20) | 5.26 s
    [Task  2/25]  Current/Best:   21.00/  21.00 GFLOPS | Progress: (12/20) | 6.59 s
    [Task  2/25]  Current/Best:   12.16/  21.00 GFLOPS | Progress: (16/20) | 7.89 s
    [Task  2/25]  Current/Best:   19.92/  21.00 GFLOPS | Progress: (20/20) | 9.51 s Done.
+
    [Task  3/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  3/25]  Current/Best:    1.63/  10.53 GFLOPS | Progress: (4/20) | 5.90 s
    [Task  3/25]  Current/Best:   15.27/  16.84 GFLOPS | Progress: (8/20) | 7.83 s
    [Task  3/25]  Current/Best:   14.87/  16.84 GFLOPS | Progress: (12/20) | 9.56 s
    [Task  3/25]  Current/Best:    7.22/  23.69 GFLOPS | Progress: (16/20) | 11.48 s
    [Task  3/25]  Current/Best:   12.55/  23.69 GFLOPS | Progress: (20/20) | 16.05 s Done.
+
    [Task  4/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  4/25]  Current/Best:    9.57/  20.46 GFLOPS | Progress: (4/20) | 2.42 s
    [Task  4/25]  Current/Best:    6.78/  20.46 GFLOPS | Progress: (8/20) | 7.17 s
    [Task  4/25]  Current/Best:   21.64/  21.64 GFLOPS | Progress: (12/20) | 12.06 s
    [Task  4/25]  Current/Best:   17.09/  21.64 GFLOPS | Progress: (16/20) | 14.48 s
    [Task  4/25]  Current/Best:   13.20/  21.64 GFLOPS | Progress: (20/20) | 16.47 s Done.
+
    [Task  5/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  5/25]  Current/Best:    9.75/  10.44 GFLOPS | Progress: (4/20) | 2.62 s
    [Task  5/25]  Current/Best:   11.75/  13.05 GFLOPS | Progress: (8/20) | 4.67 s
    [Task  5/25]  Current/Best:    9.61/  17.86 GFLOPS | Progress: (12/20) | 7.86 s
    [Task  5/25]  Current/Best:   11.85/  22.67 GFLOPS | Progress: (16/20) | 9.29 s
    [Task  5/25]  Current/Best:   11.94/  22.67 GFLOPS | Progress: (20/20) | 11.21 s Done.
+
    [Task  6/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  6/25]  Current/Best:   12.15/  20.67 GFLOPS | Progress: (4/20) | 4.18 s
    [Task  6/25]  Current/Best:   18.81/  20.67 GFLOPS | Progress: (8/20) | 5.92 s
    [Task  6/25]  Current/Best:   13.26/  20.67 GFLOPS | Progress: (12/20) | 7.86 s
    [Task  6/25]  Current/Best:   19.87/  20.67 GFLOPS | Progress: (16/20) | 10.10 s
    [Task  6/25]  Current/Best:    3.75/  20.67 GFLOPS | Progress: (20/20) | 12.61 s Done.
+
    [Task  7/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  7/25]  Current/Best:   10.57/  12.30 GFLOPS | Progress: (4/20) | 3.62 s
    [Task  7/25]  Current/Best:   20.10/  21.20 GFLOPS | Progress: (8/20) | 5.16 s
    [Task  7/25]  Current/Best:   16.06/  21.20 GFLOPS | Progress: (12/20) | 7.12 s
    [Task  7/25]  Current/Best:   12.24/  21.20 GFLOPS | Progress: (16/20) | 9.18 s
    [Task  7/25]  Current/Best:    6.27/  21.69 GFLOPS | Progress: (20/20) | 11.68 s Done.
+
    [Task  8/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  8/25]  Current/Best:   10.44/  14.57 GFLOPS | Progress: (4/20) | 2.95 s
    [Task  8/25]  Current/Best:    9.91/  14.57 GFLOPS | Progress: (8/20) | 8.03 s
    [Task  8/25]  Current/Best:   12.86/  14.57 GFLOPS | Progress: (12/20) | 14.61 s
    [Task  8/25]  Current/Best:   18.99/  18.99 GFLOPS | Progress: (16/20) | 16.69 s
    [Task  8/25]  Current/Best:   20.19/  20.19 GFLOPS | Progress: (20/20) | 23.74 s Done.
+
    [Task  9/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task  9/25]  Current/Best:   14.28/  14.63 GFLOPS | Progress: (4/20) | 12.01 s
    [Task  9/25]  Current/Best:   23.28/  23.28 GFLOPS | Progress: (8/20) | 13.82 s
    [Task  9/25]  Current/Best:    8.17/  23.28 GFLOPS | Progress: (12/20) | 16.33 s
    [Task  9/25]  Current/Best:   17.84/  23.28 GFLOPS | Progress: (16/20) | 19.20 s
    [Task  9/25]  Current/Best:    9.07/  23.28 GFLOPS | Progress: (20/20) | 27.81 s
    [Task 10/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 10/25]  Current/Best:   18.25/  18.25 GFLOPS | Progress: (4/20) | 2.61 s
    [Task 10/25]  Current/Best:   15.56/  18.25 GFLOPS | Progress: (8/20) | 4.24 s
    [Task 10/25]  Current/Best:   12.67/  19.01 GFLOPS | Progress: (12/20) | 5.78 s
    [Task 10/25]  Current/Best:   19.04/  20.36 GFLOPS | Progress: (16/20) | 6.89 s
    [Task 10/25]  Current/Best:    8.82/  20.36 GFLOPS | Progress: (20/20
 ) | 8.42 s Done.
+
    [Task 11/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 11/25]  Current/Best:   12.28/  18.11 GFLOPS | Progress: (4/20) | 3.37 s
    [Task 11/25]  Current/Best:   16.85/  18.11 GFLOPS | Progress: (8/20) | 6.17 s
    [Task 11/25]  Current/Best:   18.11/  18.11 GFLOPS | Progress: (12/20) | 8.25 s
    [Task 11/25]  Current/Best:   13.34/  21.15 GFLOPS | Progress: (16/20) | 11.14 s
    [Task 11/25]  Current/Best:   19.37/  21.57 GFLOPS | Progress: (20/20) | 13.24 s Done.
+
    [Task 12/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 12/25]  Current/Best:    7.79/  18.09 GFLOPS | Progress: (4/20) | 5.78 s
    [Task 12/25]  Current/Best:    5.25/  18.09 GFLOPS | Progress: (8/20) | 9.70 s
    [Task 12/25]  Current/Best:   18.88/  18.93 GFLOPS | Progress: (12/20) | 11.70 s
    [Task 12/25]  Current/Best:   15.08/  18.93 GFLOPS | Progress: (16/20) | 14.61 s
    [Task 12/25]  Current/Best:   15.23/  19.24 GFLOPS | Progress: (20/20) | 16.53 s Done.
+
    [Task 13/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 13/25]  Current/Best:    8.91/  17.35 GFLOPS | Progress: (4/20) | 3.77 s
    [Task 13/25]  Current/Best:   16.03/  20.84 GFLOPS | Progress: (8/20) | 6.37 s
    [Task 13/25]  Current/Best:   19.36/  21.60 GFLOPS | Progress: (12/20) | 9.48 s
    [Task 13/25]  Current/Best:   12.23/  21.60 GFLOPS | Progress: (16/20) | 12.99 s
    [Task 13/25]  Current/Best:   18.54/  21.60 GFLOPS | Progress: (20/20) | 15.33 s Done.
+
    [Task 14/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 14/25]  Current/Best:   13.59/  13.59 GFLOPS | Progress: (4/20) | 3.47 s
    [Task 14/25]  Current/Best:    6.11/  13.59 GFLOPS | Progress: (8/20) | 5.65 s
    [Task 14/25]  Current/Best:   21.01/  21.01 GFLOPS | Progress: (12/20) | 8.31 s
    [Task 14/25]  Current/Best:   16.53/  21.01 GFLOPS | Progress: (16/20) | 9.98 s Done.
+
    [Task 14/25]  Current/Best:   17.14/  21.01 GFLOPS | Progress: (20/20) | 11.76 s
    [Task 15/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 15/25]  Current/Best:   16.11/  17.63 GFLOPS | Progress: (4/20) | 2.78 s
    [Task 15/25]  Current/Best:   14.32/  17.89 GFLOPS | Progress: (8/20) | 4.14 s
    [Task 15/25]  Current/Best:   10.34/  22.22 GFLOPS | Progress: (12/20) | 6.37 s
    [Task 15/25]  Current/Best:   20.37/  22.22 GFLOPS | Progress: (16/20) | 9.48 s
    [Task 15/25]  Current/Best:    9.63/  22.22 GFLOPS | Progress: (20/20) | 10.50 s
    [Task 16/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 16/25]  Current/Best:   20.61/  20.61 GFLOPS | Progress: (4/20) | 3.06 s
    [Task 16/25]  Current/Best:    3.02/  20.61 GFLOPS | Progress: (8/20) | 4.69 s
    [Task 16/25]  Current/Best:   19.58/  20.61 GFLOPS | Progress: (12/20) | 5.90 s
    [Task 16/25]  Current/Best:   17.68/  20.61 GFLOPS | Progress: (16/20) |
  7.28 s
    [Task 16/25]  Current/Best:   10.01/  20.61 GFLOPS | Progress: (20/20) | 9.45 s Done.
+
    [Task 17/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 17/25]  Current/Best:   13.26/  18.47 GFLOPS | Progress: (4/20) | 4.86 s
    [Task 17/25]  Current/Best:   14.39/  23.25 GFLOPS | Progress: (8/20) | 7.77 s
    [Task 17/25]  Current/Best:   17.42/  23.25 GFLOPS | Progress: (12/20) | 9.83 s
    [Task 17/25]  Current/Best:   16.54/  23.25 GFLOPS | Progress: (16/20) | 12.05 s
    [Task 17/25]  Current/Best:   10.02/  23.25 GFLOPS | Progress: (20/20) | 14.25 s Done.
+
    [Task 18/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 18/25]  Current/Best:   11.34/  17.82 GFLOPS | Progress: (4/20) | 3.86 s
    [Task 18/25]  Current/Best:   10.62/  20.07 GFLOPS | Progress: (8/20) | 7.55 s
    [Task 18/25]  Current/Best:   19.16/  20.07 GFLOPS | Progress: (12/20) | 9.50 s
    [Task 18/25]  Current/Best:    9.93/  20.07 GFLOPS | Progress: (16/20) | 13.32 s
    [Task 18/25]  Current/Best:   20.73/  20.73 GFLOPS | Progress: (20/20) | 14.84 s Done.
+
    [Task 19/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 19/25]  Current/Best:    6.87/  20.21 GFLOPS | Progress: (4/20) | 6.30 s
    [Task 19/25]  Current/Best:    2.60/  20.21 GFLOPS | Progress: (8/20) | 9.63 s
    [Task 19/25]  Current/Best:   19.47/  20.97 GFLOPS | Progress: (12/20) | 12.56 s
    [Task 19/25]  Current/Best:   15.30/  21.11 GFLOPS | Progress: (16/20) | 15.55 s
    [Task 19/25]  Current/Best:    2.70/  23.35 GFLOPS | Progress: (20/20) | 18.32 s Done.
+
    [Task 20/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 20/25]  Current/Best:    9.98/  15.08 GFLOPS | Progress: (4/20) | 3.40 s Done.
      Done.
-
    [Task 20/25]  Current/Best:    9.65/  15.39 GFLOPS | Progress: (8/20) | 6.81 s
    [Task 20/25]  Current/Best:    2.32/  16.66 GFLOPS | Progress: (12/20) | 10.71 s
    [Task 20/25]  Current/Best:   12.40/  16.66 GFLOPS | Progress: (16/20) | 14.36 s
    [Task 20/25]  Current/Best:   11.73/  22.23 GFLOPS | Progress: (20/20) | 16.43 s
    [Task 21/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 21/25]  Current/Best:    6.42/  17.75 GFLOPS | Progress: (4/20) | 3.25 s
    [Task 21/25]  Current/Best:   13.88/  17.75 GFLOPS | Progress: (8/20) | 4.84 s
    [Task 21/25]  Current/Best:    1.61/  17.75 GFLOPS | Progress: (12/20) | 6.98 s
    [Task 21/25]  Current/Best:   18.13/  18.13 GFLOPS | Progress: (16/20) | 10.49 s
    [Task 21/25]  Current/Best:    4.48/  18.13 GFLOPS | Progress: (20/20) | 17.69 s
    [Task 22/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 22/25]  Current/Best:    2.71/  17.04 GFLOPS | Progress: (4/20
 ) | 2.66 s
    [Task 22/25]  Current/Best:    8.64/  21.92 GFLOPS | Progress: (8/20) | 4.61 s
    [Task 22/25]  Current/Best:   20.14/  21.92 GFLOPS | Progress: (12/20) | 7.01 s
    [Task 22/25]  Current/Best:   14.97/  21.92 GFLOPS | Progress: (16/20) | 9.11 s
    [Task 22/25]  Current/Best:   14.38/  21.92 GFLOPS | Progress: (20/20) | 10.83 s Done.
-
    [Task 23/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 23/25]  Current/Best:   17.79/  20.89 GFLOPS | Progress: (4/20) | 3.22 s
    [Task 23/25]  Current/Best:   14.46/  20.89 GFLOPS | Progress: (8/20) | 6.67 s
    [Task 23/25]  Current/Best:   21.03/  21.84 GFLOPS | Progress: (12/20) | 8.46 s
    [Task 23/25]  Current/Best:    6.52/  21.84 GFLOPS | Progress: (16/20) | 15.33 s
    [Task 23/25]  Current/Best:    7.97/  21.84 GFLOPS | Progress: (20/20) | 19.52 s Done.
-
    [Task 24/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 24/25]  Current/Best:    8.52/   8.52 GFLOPS | Progress: (4/20) | 11.76 s
    [Task 24/25]  Current/Best:    3.75/   8.52 GFLOPS | Progress: (8/20) | 22.98 s
    [Task 24/25]  Current/Best:    4.44/   8.52 GFLOPS | Progress: (12/20) | 33.69 s Done.
-
    [Task 24/25]  Current/Best:    6.12/   9.01 GFLOPS | Progress: (16/20) | 39.21 s
    [Task 24/25]  Current/Best:    3.39/   9.01 GFLOPS | Progress: (20/20) | 45.13 s Done.
-
    [Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 25/25]  Current/Best:    1.55/   2.75 GFLOPS | Progress: (4/20) | 11.58 s
    [Task 25/25]  Current/Best:    6.01/   8.38 GFLOPS | Progress: (8/20) | 22.82 s
    [Task 25/25]  Current/Best:    6.08/   8.38 GFLOPS | Progress: (12/20) | 34.09 s
    [Task 25/25]  Current/Best:    5.86/   8.99 GFLOPS | Progress: (16/20) | 35.96 s
    [Task 25/25]  Current/Best:    2.89/   9.40 GFLOPS | Progress: (20/20) | 46.62 s
+
    [Task 20/25]  Current/Best:   10.35/  15.08 GFLOPS | Progress: (8/20) | 6.82 s
    [Task 20/25]  Current/Best:    2.32/  16.56 GFLOPS | Progress: (12/20) | 10.73 s
    [Task 20/25]  Current/Best:   12.45/  16.56 GFLOPS | Progress: (16/20) | 14.63 s
    [Task 20/25]  Current/Best:   13.26/  21.96 GFLOPS | Progress: (20/20) | 16.74 s
    [Task 21/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 21/25]  Current/Best:    6.39/  17.57 GFLOPS | Progress: (4/20) | 3.34 s
    [Task 21/25]  Current/Best:   14.51/  17.57 GFLOPS | Progress: (8/20) | 4.95 s
    [Task 21/25]  Current/Best:    1.61/  17.57 GFLOPS | Progress: (12/20) | 7.12 s
    [Task 21/25]  Current/Best:   18.01/  18.01 GFLOPS | Progress: (16/20) | 10.73 s
    [Task 21/25]  Current/Best:    4.46/  18.01 GFLOPS | Progress: (20/20) | 18.13 s
    [Task 22/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 22/25]  Current/Best:    2.70/  16.88 GFLOPS | Progress: (4/20
 ) | 2.75 s
    [Task 22/25]  Current/Best:    9.08/  21.54 GFLOPS | Progress: (8/20) | 4.80 s
    [Task 22/25]  Current/Best:   19.92/  21.54 GFLOPS | Progress: (12/20) | 7.19 s
    [Task 22/25]  Current/Best:   14.94/  21.54 GFLOPS | Progress: (16/20) | 9.31 s
    [Task 22/25]  Current/Best:   14.94/  21.54 GFLOPS | Progress: (20/20) | 11.05 s Done.
+
    [Task 23/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 23/25]  Current/Best:   17.38/  20.26 GFLOPS | Progress: (4/20) | 3.31 s
    [Task 23/25]  Current/Best:   15.65/  20.26 GFLOPS | Progress: (8/20) | 6.70 s
    [Task 23/25]  Current/Best:   20.78/  21.35 GFLOPS | Progress: (12/20) | 8.56 s
    [Task 23/25]  Current/Best:    6.18/  21.35 GFLOPS | Progress: (16/20) | 15.66 s
    [Task 23/25]  Current/Best:    7.55/  21.35 GFLOPS | Progress: (20/20) | 19.97 s Done.
+
    [Task 24/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 24/25]  Current/Best:    8.54/   8.54 GFLOPS | Progress: (4/20) | 11.89 s
    [Task 24/25]  Current/Best:    3.44/   8.54 GFLOPS | Progress: (8/20) | 23.17 s
    [Task 24/25]  Current/Best:    4.33/   8.54 GFLOPS | Progress: (12/20) | 33.90 s Done.
+
    [Task 24/25]  Current/Best:    7.36/   8.70 GFLOPS | Progress: (16/20) | 39.71 s
    [Task 24/25]  Current/Best:    3.27/   9.03 GFLOPS | Progress: (20/20) | 45.71 s Done.
+
    [Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
    [Task 25/25]  Current/Best:    1.55/   2.93 GFLOPS | Progress: (4/20) | 11.65 s
    [Task 25/25]  Current/Best:    5.76/   8.11 GFLOPS | Progress: (8/20) | 22.97 s
    [Task 25/25]  Current/Best:    5.93/   8.11 GFLOPS | Progress: (12/20) | 34.40 s
    [Task 25/25]  Current/Best:    5.86/   9.34 GFLOPS | Progress: (16/20) | 36.28 s
    [Task 25/25]  Current/Best:    2.94/   9.34 GFLOPS | Progress: (20/20) | 46.99 s
 
 
 
@@ -748,8 +748,8 @@ improvement in comparing the optimized model to the unoptimized model.
 
  .. code-block:: none
 
-    optimized: {'mean': 415.2614426499713, 'median': 415.2566297500016, 'std': 0.5934935002330597}
-    unoptimized: {'mean': 490.9751048400267, 'median': 490.97516415004065, 'std': 0.751160590199003}
+    optimized: {'mean': 417.61745977002647, 'median': 417.75443490005273, 'std': 0.7636631550773011}
+    unoptimized: {'mean': 499.76453579998633, 'median': 499.21146484994097, 'std': 1.7238883331097248}
 
 
 
@@ -772,7 +772,7 @@ profiling/benchmarking.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 10 minutes  19.119 seconds)
+   **Total running time of the script:** ( 10 minutes  32.758 seconds)
 
 
 .. _sphx_glr_download_tutorial_autotvm_relay_x86.py:
diff --git a/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt b/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
index f2ac1cde9..c76d39f4d 100644
--- a/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
+++ b/docs/_sources/tutorial/cross_compilation_and_rpc.rst.txt
@@ -282,7 +282,7 @@ device and returns the measured cost. Network overhead is excluded.
 
  .. code-block:: none
 
-    1.259e-07 secs/op
+    1.256e-07 secs/op
 
 
 
diff --git a/docs/_sources/tutorial/intro_topi.rst.txt b/docs/_sources/tutorial/intro_topi.rst.txt
index 807661275..4194bf4a1 100644
--- a/docs/_sources/tutorial/intro_topi.rst.txt
+++ b/docs/_sources/tutorial/intro_topi.rst.txt
@@ -263,7 +263,7 @@ As you can see, scheduled stages of computation have been accumulated and we can
 
  .. code-block:: none
 
-    [stage(a, placeholder(a, 0x201e1120)), stage(b, placeholder(b, 0x201eec20)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(mi [...]
+    [stage(a, placeholder(a, 0xbdc5a40)), stage(b, placeholder(b, 0x19afbd60)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min [...]
 
 
 
diff --git a/docs/_sources/tutorial/sg_execution_times.rst.txt b/docs/_sources/tutorial/sg_execution_times.rst.txt
index d38e89b90..7904cfe84 100644
--- a/docs/_sources/tutorial/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorial/sg_execution_times.rst.txt
@@ -5,30 +5,30 @@
 
 Computation times
 =================
-**13:34.121** total execution time for **tutorial** files:
+**13:30.966** total execution time for **tutorial** files:
 
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_autotvm_relay_x86.py` (``autotvm_relay_x86.py``)                 | 10:19.119 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_autotvm_relay_x86.py` (``autotvm_relay_x86.py``)                 | 10:32.758 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_auto_scheduler_matmul_x86.py` (``auto_scheduler_matmul_x86.py``) | 01:19.752 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_auto_scheduler_matmul_x86.py` (``auto_scheduler_matmul_x86.py``) | 01:02.489 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_tensor_expr_get_started.py` (``tensor_expr_get_started.py``)     | 00:59.336 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_tensor_expr_get_started.py` (``tensor_expr_get_started.py``)     | 00:59.931 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_relay_quick_start.py` (``relay_quick_start.py``)                 | 00:29.731 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_relay_quick_start.py` (``relay_quick_start.py``)                 | 00:30.726 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_autotvm_matmul_x86.py` (``autotvm_matmul_x86.py``)               | 00:24.637 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_autotvm_matmul_x86.py` (``autotvm_matmul_x86.py``)               | 00:23.659 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_intro_topi.py` (``intro_topi.py``)                               | 00:00.693 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_intro_topi.py` (``intro_topi.py``)                               | 00:00.713 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_tensor_ir_blitz_course.py` (``tensor_ir_blitz_course.py``)       | 00:00.689 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_tensor_ir_blitz_course.py` (``tensor_ir_blitz_course.py``)       | 00:00.519 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_cross_compilation_and_rpc.py` (``cross_compilation_and_rpc.py``) | 00:00.157 | 0.0 MB |
+| :ref:`sphx_glr_tutorial_cross_compilation_and_rpc.py` (``cross_compilation_and_rpc.py``) | 00:00.164 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_tutorial_introduction.py` (``introduction.py``)                           | 00:00.005 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorial_tvmc_command_line_driver.py` (``tvmc_command_line_driver.py``)   | 00:00.001 | 0.0 MB |
++------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_tutorial_tvmc_python.py` (``tvmc_python.py``)                             | 00:00.001 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_tutorial_install.py` (``install.py``)                                     | 00:00.001 | 0.0 MB |
 +------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorial_tvmc_command_line_driver.py` (``tvmc_command_line_driver.py``)   | 00:00.001 | 0.0 MB |
-+------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/tutorial/tensor_expr_get_started.rst.txt b/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
index 828055eda..c2f620138 100644
--- a/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
+++ b/docs/_sources/tutorial/tensor_expr_get_started.rst.txt
@@ -301,8 +301,8 @@ helper function to run a profile of the TVM generated code.
 
  .. code-block:: none
 
-    Numpy running time: 0.000008
-    naive: 0.000006
+    Numpy running time: 0.000007
+    naive: 0.000007
 
 
 
@@ -460,7 +460,7 @@ factor to be the number of threads on your CPU.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    vector: 0.000025
+    vector: 0.000026
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [(stride: int32*n: int32)], [], type="auto"),
@@ -512,10 +512,10 @@ We can now compare the different schedules
  .. code-block:: none
 
                 Operator                  Timing             Performance
-                   numpy    8.126039992930601e-06                    1.0
-                   naive              5.9734e-06      0.7350936009663588
-                parallel              6.9913e-06      0.8603575672876593
-                  vector              2.4531e-05      3.0188135944865144
+                   numpy    7.137789998523658e-06                    1.0
+                   naive              6.6301e-06      0.9288729426575081
+                parallel    6.8720000000000004e-06    0.9627629842600256
+                  vector    2.5590000000000004e-05    3.5851433013990186
 
 
 
@@ -936,7 +936,7 @@ matrix multiplication.
 
  .. code-block:: none
 
-    Numpy running time: 0.017564
+    Numpy running time: 0.018889
 
 
 
@@ -996,7 +996,7 @@ optimizations.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    none: 3.298072
+    none: 3.289132
 
 
 
@@ -1101,7 +1101,7 @@ schedule.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    blocking: 0.307013
+    blocking: 0.329313
 
 
 
@@ -1199,7 +1199,7 @@ already cache friendly from our previous optimizations.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    vectorization: 0.340419
+    vectorization: 0.358022
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1275,7 +1275,7 @@ more cache friendly.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    loop permutation: 0.111652
+    loop permutation: 0.116325
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1376,7 +1376,7 @@ optimized schedule.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    array packing: 0.108105
+    array packing: 0.109678
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1471,7 +1471,7 @@ to `C` when all the block results are ready.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    block caching: 0.110703
+    block caching: 0.110596
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1559,7 +1559,7 @@ of thread-level parallelization.
 
     /workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
       "target_host parameter is going to be deprecated. "
-    parallelization: 0.144381
+    parallelization: 0.144771
     @main = primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1640,13 +1640,13 @@ working, we can compare the results.
  .. code-block:: none
 
                 Operator                  Timing             Performance
-                    none            3.2980724726                     1.0
-                blocking            0.3070129586     0.09308860285837492
-           vectorization            0.3404186675      0.1032174612074654
-        loop permutation     0.11165196200000001    0.033853701799336264
-           array packing     0.10810546380000001     0.03277837727888867
-           block caching     0.11070322930000001     0.03356603901815666
-         parallelization     0.14438079519999997     0.04377732642308461
+                    none            3.2891317263                     1.0
+                blocking            0.3293131147     0.10012159502971622
+           vectorization            0.3580215504     0.10884986683179895
+        loop permutation            0.1163250125    0.035366480329705734
+           array packing     0.10967751880000001    0.033345432146427934
+           block caching     0.11059642839999999     0.03362480970757951
+         parallelization            0.1447713667    0.044015071072527626
 
 
 
diff --git a/docs/commit_hash b/docs/commit_hash
index e3ec3686e..8cb275138 100644
--- a/docs/commit_hash
+++ b/docs/commit_hash
@@ -1 +1 @@
-3c737fbd5baccc60aff355b40105220c148b7d7f
+ebbce649f08dca6a870e7845febc39125b240001
diff --git a/docs/how_to/compile_models/from_darknet.html b/docs/how_to/compile_models/from_darknet.html
index 7abebae8f..06d581eb0 100644
--- a/docs/how_to/compile_models/from_darknet.html
+++ b/docs/how_to/compile_models/from_darknet.html
@@ -574,7 +574,7 @@ class:[&#39;truck 0.9266&#39;] left:471 top:83 right:689 bottom:169
 class:[&#39;bicycle 0.9984&#39;] left:111 top:113 right:577 bottom:447
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  2.779 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  4.745 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-compile-models-from-darknet-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7716f96385bd5abb6e822041e285be54/from_darknet.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">from_darknet.py</span></code></a></p>
diff --git a/docs/how_to/compile_models/from_mxnet.html b/docs/how_to/compile_models/from_mxnet.html
index 20fdc8189..86aa7142a 100644
--- a/docs/how_to/compile_models/from_mxnet.html
+++ b/docs/how_to/compile_models/from_mxnet.html
@@ -427,7 +427,7 @@ to download the full example code</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;x&quot;</span><span class="p">,</span> <a href="https://docs.python.org/3/library/stdtypes.html#tuple" title="builtins.tuple" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">x</span><span class="o">.</span><span class="n">shape</span></a><span class="p">)</span>
 </pre></div>
 </div>
-<img src="../../_images/sphx_glr_from_mxnet_001.png" srcset="../../_images/sphx_glr_from_mxnet_001.png" alt="from mxnet" class = "sphx-glr-single-img"/><div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zipfcddb97c-cd25-48a0-9e13-dae7422c5359 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
+<img src="../../_images/sphx_glr_from_mxnet_001.png" srcset="../../_images/sphx_glr_from_mxnet_001.png" alt="from mxnet" class = "sphx-glr-single-img"/><div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip2c0a41eb-8ea5-4907-8fef-fd0b4319dbfd from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
 x (1, 3, 224, 224)
 </pre></div>
 </div>
diff --git a/docs/how_to/compile_models/from_oneflow.html b/docs/how_to/compile_models/from_oneflow.html
index 9c961d3dc..7e83d5e21 100644
--- a/docs/how_to/compile_models/from_oneflow.html
+++ b/docs/how_to/compile_models/from_oneflow.html
@@ -432,13 +432,14 @@ python3 -m pip install -f https://release.oneflow.info <span class="nv">oneflow<
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: &quot;https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip&quot; to /workspace/.oneflow/flowvision_cache/resnet18.zip
 
   0%|          | 0.00/41.5M [00:00&lt;?, ?B/s]
- 15%|#5        | 6.33M/41.5M [00:00&lt;00:00, 40.9MB/s]
- 25%|##4       | 10.2M/41.5M [00:00&lt;00:01, 28.7MB/s]
- 39%|###8      | 16.0M/41.5M [00:00&lt;00:01, 26.5MB/s]
- 58%|#####7    | 24.0M/41.5M [00:00&lt;00:00, 30.8MB/s]
- 77%|#######7  | 32.0M/41.5M [00:00&lt;00:00, 41.4MB/s]
- 96%|#########6| 40.0M/41.5M [00:01&lt;00:00, 46.5MB/s]
-100%|##########| 41.5M/41.5M [00:01&lt;00:00, 40.3MB/s]
+ 15%|#5        | 6.33M/41.5M [00:00&lt;00:00, 41.9MB/s]
+ 25%|##4       | 10.3M/41.5M [00:00&lt;00:01, 32.1MB/s]
+ 35%|###4      | 14.3M/41.5M [00:00&lt;00:01, 26.2MB/s]
+ 42%|####2     | 17.5M/41.5M [00:00&lt;00:00, 27.9MB/s]
+ 58%|#####7    | 24.0M/41.5M [00:00&lt;00:00, 32.5MB/s]
+ 77%|#######7  | 32.0M/41.5M [00:00&lt;00:00, 41.0MB/s]
+ 92%|#########2| 38.3M/41.5M [00:01&lt;00:00, 42.7MB/s]
+100%|##########| 41.5M/41.5M [00:01&lt;00:00, 37.6MB/s]
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/compile_models/from_pytorch.html b/docs/how_to/compile_models/from_pytorch.html
index e2037721d..db3c373b9 100644
--- a/docs/how_to/compile_models/from_pytorch.html
+++ b/docs/how_to/compile_models/from_pytorch.html
@@ -414,9 +414,9 @@ be unstable.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: &quot;https://download.pytorch.org/models/resnet18-f37072fd.pth&quot; to /workspace/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
 
   0%|          | 0.00/44.7M [00:00&lt;?, ?B/s]
- 37%|###7      | 16.6M/44.7M [00:00&lt;00:00, 174MB/s]
- 74%|#######4  | 33.3M/44.7M [00:00&lt;00:00, 174MB/s]
-100%|##########| 44.7M/44.7M [00:00&lt;00:00, 187MB/s]
+ 44%|####4     | 19.7M/44.7M [00:00&lt;00:00, 206MB/s]
+ 96%|#########6| 43.1M/44.7M [00:00&lt;00:00, 229MB/s]
+100%|##########| 44.7M/44.7M [00:00&lt;00:00, 227MB/s]
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/compile_models/from_tensorflow.html b/docs/how_to/compile_models/from_tensorflow.html
index 40459affc..bd8c343f9 100644
--- a/docs/how_to/compile_models/from_tensorflow.html
+++ b/docs/how_to/compile_models/from_tensorflow.html
@@ -636,7 +636,7 @@ banana (score = 0.00022)
 desk (score = 0.00019)
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  4.223 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  4.067 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-compile-models-from-tensorflow-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">from_tensorflow.py</span></code></a></p>
diff --git a/docs/how_to/compile_models/sg_execution_times.html b/docs/how_to/compile_models/sg_execution_times.html
index f383a529b..32c938c69 100644
--- a/docs/how_to/compile_models/sg_execution_times.html
+++ b/docs/how_to/compile_models/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-compile-models-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>04:58.294</strong> total execution time for <strong>how_to_compile_models</strong> files:</p>
+<p><strong>05:10.189</strong> total execution time for <strong>how_to_compile_models</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 81%" />
@@ -335,44 +335,44 @@
 <col style="width: 8%" />
 </colgroup>
 <tbody>
-<tr class="row-odd"><td><p><a class="reference internal" href="from_tensorflow.html#sphx-glr-how-to-compile-models-from-tensorflow-py"><span class="std std-ref">Compile Tensorflow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tensorflow.py</span></code>)</p></td>
-<td><p>01:04.223</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="from_darknet.html#sphx-glr-how-to-compile-models-from-darknet-py"><span class="std std-ref">Compile YOLO-V2 and YOLO-V3 in DarkNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_darknet.py</span></code>)</p></td>
+<td><p>01:04.745</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="from_darknet.html#sphx-glr-how-to-compile-models-from-darknet-py"><span class="std std-ref">Compile YOLO-V2 and YOLO-V3 in DarkNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_darknet.py</span></code>)</p></td>
-<td><p>01:02.779</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="from_tensorflow.html#sphx-glr-how-to-compile-models-from-tensorflow-py"><span class="std std-ref">Compile Tensorflow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tensorflow.py</span></code>)</p></td>
+<td><p>01:04.067</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="from_paddle.html#sphx-glr-how-to-compile-models-from-paddle-py"><span class="std std-ref">Compile PaddlePaddle Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_paddle.py</span></code>)</p></td>
-<td><p>00:37.794</p></td>
+<td><p>00:40.949</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="from_oneflow.html#sphx-glr-how-to-compile-models-from-oneflow-py"><span class="std std-ref">Compile OneFlow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_oneflow.py</span></code>)</p></td>
-<td><p>00:26.951</p></td>
+<td><p>00:29.185</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="from_tflite.html#sphx-glr-how-to-compile-models-from-tflite-py"><span class="std std-ref">Compile TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tflite.py</span></code>)</p></td>
-<td><p>00:25.243</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="from_mxnet.html#sphx-glr-how-to-compile-models-from-mxnet-py"><span class="std std-ref">Compile MXNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_mxnet.py</span></code>)</p></td>
+<td><p>00:25.321</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="from_mxnet.html#sphx-glr-how-to-compile-models-from-mxnet-py"><span class="std std-ref">Compile MXNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_mxnet.py</span></code>)</p></td>
-<td><p>00:24.041</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="from_tflite.html#sphx-glr-how-to-compile-models-from-tflite-py"><span class="std std-ref">Compile TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tflite.py</span></code>)</p></td>
+<td><p>00:24.691</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="from_coreml.html#sphx-glr-how-to-compile-models-from-coreml-py"><span class="std std-ref">Compile CoreML Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_coreml.py</span></code>)</p></td>
-<td><p>00:21.237</p></td>
+<td><p>00:22.786</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="from_pytorch.html#sphx-glr-how-to-compile-models-from-pytorch-py"><span class="std std-ref">Compile PyTorch Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_pytorch.py</span></code>)</p></td>
-<td><p>00:19.060</p></td>
+<td><p>00:20.700</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="from_keras.html#sphx-glr-how-to-compile-models-from-keras-py"><span class="std std-ref">Compile Keras Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_keras.py</span></code>)</p></td>
-<td><p>00:14.560</p></td>
+<td><p>00:15.335</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="from_onnx.html#sphx-glr-how-to-compile-models-from-onnx-py"><span class="std std-ref">Compile ONNX Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_onnx.py</span></code>)</p></td>
-<td><p>00:02.406</p></td>
+<td><p>00:02.409</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/how_to/deploy_models/deploy_model_on_android.html b/docs/how_to/deploy_models/deploy_model_on_android.html
index a6559c30c..1b3b65c6a 100644
--- a/docs/how_to/deploy_models/deploy_model_on_android.html
+++ b/docs/how_to/deploy_models/deploy_model_on_android.html
@@ -653,7 +653,7 @@ to the remote android device.</p>
 Evaluate inference time cost...
 Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  15.4404      15.4349      15.4982      15.3954       0.0273
+  16.5195      16.4907      17.1171      15.9516       0.4699
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/deploy_models/deploy_object_detection_pytorch.html b/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
index 62d7fac33..684be22ab 100644
--- a/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
+++ b/docs/how_to/deploy_models/deploy_object_detection_pytorch.html
@@ -436,13 +436,42 @@ be unstable.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: &quot;https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth&quot; to /workspace/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
 
   0%|          | 0.00/170M [00:00&lt;?, ?B/s]
- 12%|#1        | 20.0M/170M [00:00&lt;00:00, 210MB/s]
- 27%|##7       | 45.9M/170M [00:00&lt;00:00, 246MB/s]
- 41%|####      | 69.4M/170M [00:00&lt;00:00, 236MB/s]
- 56%|#####6    | 95.4M/170M [00:00&lt;00:00, 250MB/s]
- 71%|#######1  | 121M/170M [00:00&lt;00:00, 256MB/s]
- 87%|########6 | 147M/170M [00:00&lt;00:00, 262MB/s]
-100%|##########| 170M/170M [00:00&lt;00:00, 255MB/s]
+  3%|2         | 4.44M/170M [00:00&lt;00:03, 46.2MB/s]
+  5%|5         | 8.84M/170M [00:00&lt;00:04, 35.8MB/s]
+  8%|8         | 13.6M/170M [00:00&lt;00:03, 41.3MB/s]
+ 10%|#         | 17.7M/170M [00:00&lt;00:05, 31.8MB/s]
+ 12%|#2        | 21.1M/170M [00:00&lt;00:05, 29.3MB/s]
+ 14%|#4        | 24.1M/170M [00:00&lt;00:06, 23.3MB/s]
+ 16%|#6        | 27.8M/170M [00:00&lt;00:05, 26.7MB/s]
+ 18%|#8        | 30.9M/170M [00:01&lt;00:05, 28.2MB/s]
+ 20%|#9        | 33.8M/170M [00:01&lt;00:05, 26.9MB/s]
+ 22%|##2       | 37.7M/170M [00:01&lt;00:04, 30.5MB/s]
+ 25%|##4       | 42.3M/170M [00:01&lt;00:03, 35.2MB/s]
+ 29%|##8       | 48.5M/170M [00:01&lt;00:02, 42.9MB/s]
+ 32%|###1      | 54.3M/170M [00:01&lt;00:02, 47.9MB/s]
+ 35%|###4      | 59.2M/170M [00:01&lt;00:02, 49.0MB/s]
+ 38%|###7      | 64.0M/170M [00:02&lt;00:03, 32.5MB/s]
+ 40%|####      | 68.0M/170M [00:02&lt;00:03, 33.7MB/s]
+ 43%|####2     | 72.2M/170M [00:02&lt;00:02, 36.1MB/s]
+ 45%|####5     | 76.5M/170M [00:02&lt;00:02, 38.3MB/s]
+ 48%|####7     | 81.2M/170M [00:02&lt;00:02, 41.1MB/s]
+ 50%|#####     | 85.4M/170M [00:02&lt;00:02, 35.1MB/s]
+ 53%|#####2    | 89.4M/170M [00:02&lt;00:02, 36.5MB/s]
+ 56%|#####6    | 95.4M/170M [00:02&lt;00:01, 43.5MB/s]
+ 59%|#####8    | 99.9M/170M [00:03&lt;00:02, 26.2MB/s]
+ 62%|######2   | 105M/170M [00:03&lt;00:02, 32.0MB/s]
+ 65%|######5   | 111M/170M [00:03&lt;00:01, 36.7MB/s]
+ 69%|######9   | 117M/170M [00:03&lt;00:01, 44.5MB/s]
+ 72%|#######2  | 122M/170M [00:03&lt;00:01, 46.6MB/s]
+ 76%|#######5  | 129M/170M [00:03&lt;00:00, 51.3MB/s]
+ 79%|#######8  | 134M/170M [00:03&lt;00:00, 45.8MB/s]
+ 82%|########1 | 139M/170M [00:04&lt;00:00, 34.9MB/s]
+ 84%|########4 | 143M/170M [00:04&lt;00:00, 37.1MB/s]
+ 87%|########7 | 148M/170M [00:04&lt;00:00, 41.0MB/s]
+ 90%|######### | 154M/170M [00:04&lt;00:00, 44.4MB/s]
+ 94%|#########3| 159M/170M [00:04&lt;00:00, 48.0MB/s]
+ 97%|#########6| 164M/170M [00:04&lt;00:00, 48.9MB/s]
+100%|##########| 170M/170M [00:04&lt;00:00, 38.4MB/s]
 /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3878: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
   for i in range(dim)
 /usr/local/lib/python3.7/dist-packages/torchvision/models/detection/anchor_utils.py:127: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the &#39;trunc&#39; function NOT &#39;floor&#39;). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode=&#39;trunc&#39;), or for actual floor division, use torch.div(a, b, rounding_mode=&#39;floor&#39;).
@@ -537,7 +566,7 @@ torchvision rcnn models.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Get 9 valid boxes
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  49.764 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 3 minutes  8.564 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-object-detection-pytorch-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_object_detection_pytorch.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_prequantized.html b/docs/how_to/deploy_models/deploy_prequantized.html
index 3d8537bd8..86b0aadb4 100644
--- a/docs/how_to/deploy_models/deploy_prequantized.html
+++ b/docs/how_to/deploy_models/deploy_prequantized.html
@@ -480,7 +480,9 @@ training. Other models require a full post training calibration.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading: &quot;https://download.pytorch.org/models/mobilenet_v2-b0353104.pth&quot; to /workspace/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
 
   0%|          | 0.00/13.6M [00:00&lt;?, ?B/s]
-100%|##########| 13.6M/13.6M [00:00&lt;00:00, 158MB/s]
+ 32%|###2      | 4.37M/13.6M [00:00&lt;00:00, 45.6MB/s]
+ 64%|######4   | 8.72M/13.6M [00:00&lt;00:00, 41.9MB/s]
+100%|##########| 13.6M/13.6M [00:00&lt;00:00, 59.2MB/s]
 </pre></div>
 </div>
 </div>
@@ -569,7 +571,7 @@ output values are identical out of 1000 outputs from mobilenet v2.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  90.1732      90.1488      90.6934      89.7875       0.1924
+  90.4874      90.4642      90.9767      90.1628       0.2075
 </pre></div>
 </div>
 <div class="admonition note">
@@ -608,7 +610,7 @@ This includes support for the VNNI 8 bit dot product instruction (CascadeLake or
 <div class="section" id="deploy-a-quantized-tflite-model">
 <h2>Deploy a quantized TFLite Model<a class="headerlink" href="#deploy-a-quantized-tflite-model" title="Permalink to this headline">¶</a></h2>
 <p>TODO</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  6.625 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  10.849 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-prequantized-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/fb8217c13f4351224c6cf3aacf1a87fc/deploy_prequantized.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_prequantized.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_prequantized_tflite.html b/docs/how_to/deploy_models/deploy_prequantized_tflite.html
index b66688221..b5a34f9ff 100644
--- a/docs/how_to/deploy_models/deploy_prequantized_tflite.html
+++ b/docs/how_to/deploy_models/deploy_prequantized_tflite.html
@@ -573,7 +573,7 @@ TFLite Top-5 labels: [387 102 386 341 349]
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  116.9648     116.7765     126.4263     115.8638      1.2465
+  120.1767     120.1294     121.7392     119.6096      0.3312
 </pre></div>
 </div>
 <div class="admonition note">
@@ -601,7 +601,7 @@ network for ARM CPU</span></a>.</p></li>
 </ul>
 </div></blockquote>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  58.070 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  59.496 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-prequantized-tflite-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/56691c7a27d45da61d112276334640d3/deploy_prequantized_tflite.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_prequantized_tflite.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_quantized.html b/docs/how_to/deploy_models/deploy_quantized.html
index 1a25f2397..514feb12b 100644
--- a/docs/how_to/deploy_models/deploy_quantized.html
+++ b/docs/how_to/deploy_models/deploy_quantized.html
@@ -509,7 +509,7 @@ for calibration. But the accuracy might be impacted.</p>
   DeprecationWarning,
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  32.420 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  46.615 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-quantized-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/7810ecf51bfc05f7d5e8a400ac3e815d/deploy_quantized.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_quantized.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/deploy_ssd_gluoncv.html b/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
index 6e608f2f3..9ffb9a86e 100644
--- a/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
+++ b/docs/how_to/deploy_models/deploy_ssd_gluoncv.html
@@ -441,23 +441,23 @@ to your device.</p>
 Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/ssd_512_resnet50_v1_voc-9c8b225a.zip...
 
   0%|          | 0/132723 [00:00&lt;?, ?KB/s]
-  3%|3         | 4190/132723 [00:00&lt;00:03, 41834.71KB/s]
-  9%|9         | 11988/132723 [00:00&lt;00:01, 63079.75KB/s]
- 14%|#3        | 18298/132723 [00:00&lt;00:02, 42188.68KB/s]
- 21%|##        | 27374/132723 [00:00&lt;00:01, 56935.47KB/s]
- 26%|##6       | 34595/132723 [00:00&lt;00:01, 61555.74KB/s]
- 33%|###2      | 43748/132723 [00:00&lt;00:01, 70599.41KB/s]
- 39%|###8      | 51319/132723 [00:00&lt;00:01, 66518.64KB/s]
- 45%|####5     | 60348/132723 [00:00&lt;00:00, 73269.43KB/s]
- 51%|#####1    | 68014/132723 [00:01&lt;00:01, 64089.00KB/s]
- 58%|#####8    | 77195/132723 [00:01&lt;00:00, 71361.94KB/s]
- 64%|######3   | 84741/132723 [00:01&lt;00:00, 61955.22KB/s]
- 71%|#######   | 93881/132723 [00:01&lt;00:00, 69309.12KB/s]
- 76%|#######6  | 101295/132723 [00:01&lt;00:00, 63552.54KB/s]
- 83%|########3 | 110495/132723 [00:01&lt;00:00, 70742.83KB/s]
- 89%|########8 | 117985/132723 [00:01&lt;00:00, 67670.14KB/s]
- 96%|#########5| 126937/132723 [00:01&lt;00:00, 73416.29KB/s]
-100%|##########| 132723/132723 [00:01&lt;00:00, 66870.19KB/s]
+  5%|4         | 6414/132723 [00:00&lt;00:01, 64136.24KB/s]
+ 11%|#1        | 14697/132723 [00:00&lt;00:01, 75127.29KB/s]
+ 17%|#6        | 22210/132723 [00:00&lt;00:01, 62992.49KB/s]
+ 22%|##2       | 29748/132723 [00:00&lt;00:01, 67369.88KB/s]
+ 28%|##7       | 36655/132723 [00:00&lt;00:01, 60766.91KB/s]
+ 34%|###3      | 44871/132723 [00:00&lt;00:01, 67185.44KB/s]
+ 40%|####      | 53219/132723 [00:00&lt;00:01, 72071.74KB/s]
+ 46%|####6     | 61467/132723 [00:00&lt;00:00, 75191.29KB/s]
+ 53%|#####2    | 69848/132723 [00:00&lt;00:00, 77773.84KB/s]
+ 59%|#####8    | 77735/132723 [00:01&lt;00:00, 77474.75KB/s]
+ 65%|######4   | 86172/132723 [00:01&lt;00:00, 79529.54KB/s]
+ 71%|#######1  | 94542/132723 [00:01&lt;00:00, 80773.09KB/s]
+ 77%|#######7  | 102662/132723 [00:01&lt;00:00, 58593.07KB/s]
+ 84%|########3 | 110973/132723 [00:01&lt;00:00, 64386.46KB/s]
+ 89%|########9 | 118189/132723 [00:01&lt;00:00, 64512.58KB/s]
+ 95%|#########5| 126535/132723 [00:01&lt;00:00, 69429.16KB/s]
+100%|##########| 132723/132723 [00:01&lt;00:00, 70072.68KB/s]
 </pre></div>
 </div>
 <p>Create TVM runtime and do inference
@@ -500,7 +500,7 @@ Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from h
 <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
 </pre></div>
 </div>
-<img src="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" srcset="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" alt="deploy ssd gluoncv" class = "sphx-glr-single-img"/><p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  28.615 seconds)</p>
+<img src="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" srcset="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" alt="deploy ssd gluoncv" class = "sphx-glr-single-img"/><p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  37.180 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-deploy-models-deploy-ssd-gluoncv-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/cccb17d28e5e8b2e94ea8cd5ec59f6ed/deploy_ssd_gluoncv.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_ssd_gluoncv.py</span></code></a></p>
diff --git a/docs/how_to/deploy_models/sg_execution_times.html b/docs/how_to/deploy_models/sg_execution_times.html
index 454432855..8ea638d19 100644
--- a/docs/how_to/deploy_models/sg_execution_times.html
+++ b/docs/how_to/deploy_models/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-deploy-models-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>10:46.088</strong> total execution time for <strong>how_to_deploy_models</strong> files:</p>
+<p><strong>11:36.686</strong> total execution time for <strong>how_to_deploy_models</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 86%" />
@@ -336,31 +336,31 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="deploy_object_detection_pytorch.html#sphx-glr-how-to-deploy-models-deploy-object-detection-pytorch-py"><span class="std std-ref">Compile PyTorch Object Detection Models</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_object_detection_pytorch.py</span></code>)</p></td>
-<td><p>02:49.764</p></td>
+<td><p>03:08.564</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="deploy_ssd_gluoncv.html#sphx-glr-how-to-deploy-models-deploy-ssd-gluoncv-py"><span class="std std-ref">Deploy Single Shot Multibox Detector(SSD) model</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_ssd_gluoncv.py</span></code>)</p></td>
-<td><p>02:28.615</p></td>
+<td><p>02:37.180</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="deploy_prequantized_tflite.html#sphx-glr-how-to-deploy-models-deploy-prequantized-tflite-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized_tflite.py</span></code>)</p></td>
-<td><p>01:58.070</p></td>
+<td><p>01:59.496</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="deploy_quantized.html#sphx-glr-how-to-deploy-models-deploy-quantized-py"><span class="std std-ref">Deploy a Quantized Model on Cuda</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_quantized.py</span></code>)</p></td>
-<td><p>01:32.420</p></td>
+<td><p>01:46.615</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="deploy_prequantized.html#sphx-glr-how-to-deploy-models-deploy-prequantized-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized.py</span></code>)</p></td>
-<td><p>01:06.625</p></td>
+<td><p>01:10.849</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="deploy_model_on_android.html#sphx-glr-how-to-deploy-models-deploy-model-on-android-py"><span class="std std-ref">Deploy the Pretrained Model on Android</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_android.py</span></code>)</p></td>
-<td><p>00:28.550</p></td>
+<td><p>00:30.609</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="deploy_model_on_rasp.html#sphx-glr-how-to-deploy-models-deploy-model-on-rasp-py"><span class="std std-ref">Deploy the Pretrained Model on Raspberry Pi</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_rasp.py</span></code>)</p></td>
-<td><p>00:22.039</p></td>
+<td><p>00:23.367</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="deploy_sparse.html#sphx-glr-how-to-deploy-models-deploy-sparse-py"><span class="std std-ref">Deploy a Hugging Face Pruned Model on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_sparse.py</span></code>)</p></td>
diff --git a/docs/how_to/extend_tvm/bring_your_own_datatypes.html b/docs/how_to/extend_tvm/bring_your_own_datatypes.html
index 3dd70e04f..d335f4fee 100644
--- a/docs/how_to/extend_tvm/bring_your_own_datatypes.html
+++ b/docs/how_to/extend_tvm/bring_your_own_datatypes.html
@@ -612,7 +612,7 @@ In this alpha state of the Bring Your Own Datatypes framework, we have not imple
 <span class="n">module</span><span class="p">,</span> <a href="https://docs.python.org/3/library/stdtypes.html#dict" title="builtins.dict" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">params</span></a> <span class="o">=</span> <span class="n">get_mobilenet</span><span class="p">()</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zipf12df131-f718-4783-9b6f-352ac0f95f20 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip01c920af-8621-4791-bb1d-d3f16afd15d3 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
 </pre></div>
 </div>
 <p>It’s easy to execute MobileNet with native TVM:</p>
@@ -676,7 +676,7 @@ In this alpha state of the Bring Your Own Datatypes framework, we have not imple
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-  Check failed: (lower) is false: Intrinsic lowering function for target llvm, intrinsic name tir.sqrt, type 150 not found
+  Check failed: (lower) is false: FloatImm lowering function for target llvm type 150 not found
 </pre></div>
 </div>
 <p>When we attempt to run the model, we get a familiar error telling us that more functions need to be registered for myfloat.</p>
diff --git a/docs/how_to/extend_tvm/sg_execution_times.html b/docs/how_to/extend_tvm/sg_execution_times.html
index 3418979d3..bfa726d66 100644
--- a/docs/how_to/extend_tvm/sg_execution_times.html
+++ b/docs/how_to/extend_tvm/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-extend-tvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:39.665</strong> total execution time for <strong>how_to_extend_tvm</strong> files:</p>
+<p><strong>00:42.211</strong> total execution time for <strong>how_to_extend_tvm</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 84%" />
@@ -336,15 +336,15 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="bring_your_own_datatypes.html#sphx-glr-how-to-extend-tvm-bring-your-own-datatypes-py"><span class="std std-ref">Bring Your Own Datatypes to TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">bring_your_own_datatypes.py</span></code>)</p></td>
-<td><p>00:36.574</p></td>
+<td><p>00:38.909</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="use_pass_instrument.html#sphx-glr-how-to-extend-tvm-use-pass-instrument-py"><span class="std std-ref">How to Use TVM Pass Instrument</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_instrument.py</span></code>)</p></td>
-<td><p>00:02.184</p></td>
+<td><p>00:02.316</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="use_pass_infra.html#sphx-glr-how-to-extend-tvm-use-pass-infra-py"><span class="std std-ref">How to Use TVM Pass Infra</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_infra.py</span></code>)</p></td>
-<td><p>00:00.900</p></td>
+<td><p>00:00.978</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="low_level_custom_pass.html#sphx-glr-how-to-extend-tvm-low-level-custom-pass-py"><span class="std std-ref">Writing a Customized Pass</span></a> (<code class="docutils literal notranslate"><span class="pre">low_level_custom_pass.py</span></code>)</p></td>
diff --git a/docs/how_to/extend_tvm/use_pass_instrument.html b/docs/how_to/extend_tvm/use_pass_instrument.html
index 8a2640ffa..e4fdf76a7 100644
--- a/docs/how_to/extend_tvm/use_pass_instrument.html
+++ b/docs/how_to/extend_tvm/use_pass_instrument.html
@@ -512,10 +512,10 @@ profile the execution time of each passes.</p>
 </pre></div>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Printing results of timing profile...
-InferType: 8314us [8314us] (44.15%; 44.15%)
-FoldScaleAxis: 10517us [13us] (55.85%; 55.85%)
-        FoldConstant: 10504us [2209us] (55.78%; 99.88%)
-                InferType: 8296us [8296us] (44.05%; 78.97%)
+InferType: 6732us [6732us] (45.68%; 45.68%)
+FoldScaleAxis: 8004us [7us] (54.32%; 54.32%)
+        FoldConstant: 7997us [1641us] (54.27%; 99.91%)
+                InferType: 6356us [6356us] (43.13%; 79.48%)
 </pre></div>
 </div>
 </div>
@@ -537,10 +537,10 @@ Refer to following sections and <a class="reference internal" href="../../refere
 </pre></div>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Printing results of timing profile...
-InferType: 7433us [7433us] (49.04%; 49.04%)
-FoldScaleAxis: 7724us [4us] (50.96%; 50.96%)
-        FoldConstant: 7720us [1618us] (50.93%; 99.94%)
-                InferType: 6102us [6102us] (40.26%; 79.04%)
+InferType: 6545us [6545us] (43.81%; 43.81%)
+FoldScaleAxis: 8397us [7us] (56.19%; 56.19%)
+        FoldConstant: 8389us [1687us] (56.15%; 99.91%)
+                InferType: 6703us [6703us] (44.86%; 79.89%)
 </pre></div>
 </div>
 <p>Register empty list to clear existing instruments.</p>
diff --git a/docs/how_to/optimize_operators/opt_conv_cuda.html b/docs/how_to/optimize_operators/opt_conv_cuda.html
index f8e27aeea..d1d68924f 100644
--- a/docs/how_to/optimize_operators/opt_conv_cuda.html
+++ b/docs/how_to/optimize_operators/opt_conv_cuda.html
@@ -564,7 +564,7 @@ latency of convolution.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Convolution: </span><span class="si">%f</span><span class="s2"> ms&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span> <span class="o">*</span> <span cl [...]
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Convolution: 47.030861 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Convolution: 34.195594 ms
 </pre></div>
 </div>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-optimize-operators-opt-conv-cuda-py">
diff --git a/docs/how_to/optimize_operators/opt_conv_tensorcore.html b/docs/how_to/optimize_operators/opt_conv_tensorcore.html
index 97acb8c04..cfa81c80b 100644
--- a/docs/how_to/optimize_operators/opt_conv_tensorcore.html
+++ b/docs/how_to/optimize_operators/opt_conv_tensorcore.html
@@ -906,7 +906,7 @@ be able to run on our build server</p>
     <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;conv2d with tensor core: </span><span class="si">%f</span><span class="s2"> ms&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span> <span class="o">* [...]
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>conv2d with tensor core: 10.383821 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>conv2d with tensor core: 13.380213 ms
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/optimize_operators/opt_gemm.html b/docs/how_to/optimize_operators/opt_gemm.html
index bb459a9d2..de57d0cb4 100644
--- a/docs/how_to/optimize_operators/opt_gemm.html
+++ b/docs/how_to/optimize_operators/opt_gemm.html
@@ -461,8 +461,8 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Baseline: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.017734
-Baseline: 3.304103
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.019290
+Baseline: 3.314951
 </pre></div>
 </div>
 <p>In TVM, we can always inspect lower level IR to debug or optimize our schedule.
@@ -522,7 +522,7 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt1: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt1: 0.298228
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt1: 0.319761
 </pre></div>
 </div>
 <p>Here is the generated IR after blocking.</p>
@@ -589,7 +589,7 @@ vastly.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt2: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt2: 0.333569
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt2: 0.357257
 </pre></div>
 </div>
 <p>Here is the generated IR after vectorization.</p>
@@ -650,7 +650,7 @@ the access pattern for A matrix is more cache friendly.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt3: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt3: 0.112827
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt3: 0.121127
 </pre></div>
 </div>
 <p>Here is the generated IR after loop permutation.</p>
@@ -733,7 +733,7 @@ flattening.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt4: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt4: 0.110056
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt4: 0.112509
 </pre></div>
 </div>
 <p>Here is the generated IR after array packing.</p>
@@ -819,7 +819,7 @@ write to C when all the block results are ready.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt5: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">evaluator</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt5: 0.110797
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt5: 0.112328
 </pre></div>
 </div>
 <p>Here is the generated IR after blocking.</p>
@@ -909,7 +909,7 @@ write to C when all the block results are ready.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Opt6: </span><span class="si">%f</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">opt6_time</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt6: 0.144770
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Opt6: 0.145362
 </pre></div>
 </div>
 <p>Here is the generated IR after parallelization.</p>
diff --git a/docs/how_to/optimize_operators/sg_execution_times.html b/docs/how_to/optimize_operators/sg_execution_times.html
index 888a2dd22..820aa1e86 100644
--- a/docs/how_to/optimize_operators/sg_execution_times.html
+++ b/docs/how_to/optimize_operators/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-optimize-operators-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:33.720</strong> total execution time for <strong>how_to_optimize_operators</strong> files:</p>
+<p><strong>00:35.039</strong> total execution time for <strong>how_to_optimize_operators</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 83%" />
@@ -336,15 +336,15 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="opt_gemm.html#sphx-glr-how-to-optimize-operators-opt-gemm-py"><span class="std std-ref">How to optimize GEMM on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_gemm.py</span></code>)</p></td>
-<td><p>00:31.640</p></td>
+<td><p>00:32.541</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="opt_conv_tensorcore.html#sphx-glr-how-to-optimize-operators-opt-conv-tensorcore-py"><span class="std std-ref">How to optimize convolution using TensorCores</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_tensorcore.py</span></code>)</p></td>
-<td><p>00:01.183</p></td>
+<td><p>00:01.390</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="opt_conv_cuda.html#sphx-glr-how-to-optimize-operators-opt-conv-cuda-py"><span class="std std-ref">How to optimize convolution on GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">opt_conv_cuda.py</span></code>)</p></td>
-<td><p>00:00.896</p></td>
+<td><p>00:01.107</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/how_to/tune_with_autoscheduler/sg_execution_times.html b/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
index d2523cdb2..d1afb43b3 100644
--- a/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
+++ b/docs/how_to/tune_with_autoscheduler/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-tune-with-autoscheduler-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>05:53.101</strong> total execution time for <strong>how_to_tune_with_autoscheduler</strong> files:</p>
+<p><strong>06:05.144</strong> total execution time for <strong>how_to_tune_with_autoscheduler</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 85%" />
@@ -336,27 +336,27 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_conv2d_layer_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-conv2d-layer-cuda-py"><span class="std std-ref">Auto-scheduling a Convolution Layer for GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_layer_cuda.py</span></code>)</p></td>
-<td><p>03:10.941</p></td>
+<td><p>03:16.712</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tune_network_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-x86-py"><span class="std std-ref">Auto-scheduling a Neural Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_x86.py</span></code>)</p></td>
-<td><p>01:20.822</p></td>
+<td><p>01:23.975</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_network_cuda.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-cuda-py"><span class="std std-ref">Auto-scheduling a Neural Network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_cuda.py</span></code>)</p></td>
-<td><p>00:45.275</p></td>
+<td><p>00:46.934</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tune_sparse_x86.html#sphx-glr-how-to-tune-with-autoscheduler-tune-sparse-x86-py"><span class="std std-ref">Auto-scheduling Sparse Matrix Multiplication on CPU with Custom Sketch Rule</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_sparse_x86.py</span></code>)</p></td>
-<td><p>00:19.106</p></td>
+<td><p>00:19.548</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_network_mali.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-mali-py"><span class="std std-ref">Auto-scheduling a Neural Network for mali GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_mali.py</span></code>)</p></td>
-<td><p>00:08.551</p></td>
+<td><p>00:09.047</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tune_network_arm.html#sphx-glr-how-to-tune-with-autoscheduler-tune-network-arm-py"><span class="std std-ref">Auto-scheduling a Neural Network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_network_arm.py</span></code>)</p></td>
-<td><p>00:08.406</p></td>
+<td><p>00:08.929</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html b/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
index d048d65db..6abf4c4dd 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.html
@@ -491,65 +491,180 @@ cooperative fetching, unrolling and operator fusion.</p>
              compute: Buffer(compute_2: Pointer(float32), float32, [25088], [])}
   buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute}
   preflattened_buffer_map = {data_1: data_3: Buffer(data_2, float32, [1, 512, 7, 7], []), kernel_1: kernel_3: Buffer(kernel_2, float32, [512, 512, 3, 3], []), bias_1: bias_3: Buffer(bias_2, float32, [1, 512, 1, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [1, 512, 7, 7], [])} {
-  attr [IterVar(blockIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;blockIdx.x&quot;)] &quot;thread_extent&quot; = 32;
-  allocate(conv2d_nchw: Pointer(local float32), float32, [2]), storage_scope = local;
+  attr [IterVar(blockIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;blockIdx.x&quot;)] &quot;thread_extent&quot; = 16;
+  allocate(conv2d_nchw: Pointer(local float32), float32, [4]), storage_scope = local;
   allocate(pad_temp.shared: Pointer(shared float32), float32, [324]), storage_scope = shared;
-  allocate(kernel.shared: Pointer(shared float32), float32, [576]), storage_scope = shared;
+  allocate(kernel.shared: Pointer(shared float32), float32, [1152]), storage_scope = shared;
   attr [IterVar(threadIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 392 {
     conv2d_nchw_1: Buffer(conv2d_nchw, float32, [1], [], scope=&quot;local&quot;, align=4)[0] = 0f32
     conv2d_nchw_1[1] = 0f32
+    conv2d_nchw_1[2] = 0f32
+    conv2d_nchw_1[3] = 0f32
     for (rc.outer.outer: int32, 0, 128) {
-      attr [IterVar(threadIdx.x_1: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 392;
-      if @tir.likely((threadIdx.x_1 &lt; 324), dtype=bool) {
-        pad_temp.shared_1: Buffer(pad_temp.shared, float32, [324], [], scope=&quot;shared&quot;)[threadIdx.x_1] = @tir.if_then_else(((((9 &lt;= floormod(threadIdx.x_1, 81)) &amp;&amp; (floormod(threadIdx.x_1, 81) &lt; 72)) &amp;&amp; (1 &lt;= floormod(threadIdx.x_1, 9))) &amp;&amp; (floormod(threadIdx.x_1, 9) &lt; 8)), data[(((((rc.outer.outer*196) + (floordiv(threadIdx.x_1, 81)*49)) + (floordiv(floormod(threadIdx.x_1, 81), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-      }
-      attr [IterVar(threadIdx.x_2: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 392;
-      kernel.shared_1: Buffer(kernel.shared, float32, [576], [], scope=&quot;shared&quot;)[threadIdx.x_2] = kernel[((((blockIdx.x*73728) + (floordiv(threadIdx.x_2, 36)*4608)) + (rc.outer.outer*36)) + floormod(threadIdx.x_2, 36))]
-      attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 392;
-      if @tir.likely((threadIdx.x_2 &lt; 184), dtype=bool) {
-        kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[((((((blockIdx.x*73728) + (floordiv((threadIdx.x_2 + 392), 36)*4608)) + (rc.outer.outer*36)) + (floordiv(floormod((threadIdx.x_2 + 32), 36), 9)*9)) + (floordiv(floormod((threadIdx.x_2 + 5), 9), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-      }
-      for (rc.outer.inner: int32, 0, 2) {
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18))]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7))]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 288)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 9)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 297)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 1)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 289)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 10)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 298)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 2)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 290)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 11)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 299)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 3)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 291)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 12)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 300)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 4)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 292)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 13)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 301)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 5)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 293)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 14)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 302)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 6)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 294)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 15)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 303)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 7)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 295)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 16)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 304)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 8)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 296)]))
-        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 17)]))
-        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((((rc.outer.inner*162) + (floordiv(floormod(threadIdx.x, 49), 7)*9)) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[(((floordiv(threadIdx.x, 49)*36) + (rc.outer.inner*18)) + 305)]))
+      let cse_var_1: int32 = (rc.outer.outer*36)
+       {
+        attr [IterVar(threadIdx.x_1: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 392;
+        if @tir.likely((threadIdx.x_1 &lt; 324), dtype=bool) {
+          pad_temp.shared_1: Buffer(pad_temp.shared, float32, [324], [], scope=&quot;shared&quot;)[threadIdx.x_1] = @tir.if_then_else(((((9 &lt;= floormod(threadIdx.x_1, 81)) &amp;&amp; (floormod(threadIdx.x_1, 81) &lt; 72)) &amp;&amp; (1 &lt;= floormod(threadIdx.x_1, 9))) &amp;&amp; (floormod(threadIdx.x_1, 9) &lt; 8)), data[(((((rc.outer.outer*196) + (floordiv(threadIdx.x_1, 81)*49)) + (floordiv(floormod(threadIdx.x_1, 81), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
+        }
+        attr [IterVar(threadIdx.x_2: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 392;
+        kernel.shared_1: Buffer(kernel.shared, float32, [1152], [], scope=&quot;shared&quot;)[threadIdx.x_2] = kernel[((((blockIdx.x*147456) + (floordiv(threadIdx.x_2, 36)*4608)) + cse_var_1) + floormod(threadIdx.x_2, 36))]
+        attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 392;
+        kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[((((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 392), 36)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 32), 36), 9)*9)) + (floordiv(floormod((threadIdx.x_2 + 5), 9), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
+        attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 392;
+        if @tir.likely((threadIdx.x_2 &lt; 368), dtype=bool) {
+          kernel.shared_1[(threadIdx.x_2 + 784)] = kernel[(((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 784), 36)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 28), 36), 9)*9)) + floormod((threadIdx.x_2 + 1), 9))]
+        }
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7))]*kernel.shared_1[(floordiv(threadIdx.x, 49)*36)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7))]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 288)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7))]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 576)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7))]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 864)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 1)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 289)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 577)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 1)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 865)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 2)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 290)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 578)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 2)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 866)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 3)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 291)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 579)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 867)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 4)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 292)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 580)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 10)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 868)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 5)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 293)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 581)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 11)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 869)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 6)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 294)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 582)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 870)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 7)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 295)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 583)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 19)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 871)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 8)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 296)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 584)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 20)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 872)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 9)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 297)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 585)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 873)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 10)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 298)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 586)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 82)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 874)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 11)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 299)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 587)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 83)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 875)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 12)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 300)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 588)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 876)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 13)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 301)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 589)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 91)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 877)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 14)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 302)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 590)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 92)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 878)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 15)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 303)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 591)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 879)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 16)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 304)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 592)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 100)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 880)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 17)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 305)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 593)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 101)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 881)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 18)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 306)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 594)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 882)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 19)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 307)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 595)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 163)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 883)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 20)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 308)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 596)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 164)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 884)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 21)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 309)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 597)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 885)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 22)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 310)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 598)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 172)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 886)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 23)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 311)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 599)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 173)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 887)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 24)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 312)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 600)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 888)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 25)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 313)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 601)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 181)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 889)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 26)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 314)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 602)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 182)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 890)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 27)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 315)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 603)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 891)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 28)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 316)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 604)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 244)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 892)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 29)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 317)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 605)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 245)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 893)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 30)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 318)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 606)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 894)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 31)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 319)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 607)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 253)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 895)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 32)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 320)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 608)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 254)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 896)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 33)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 321)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 609)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 897)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 34)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 322)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 610)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 262)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 898)]))
+        conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 35)]))
+        conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 323)]))
+        conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 611)]))
+        conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((floordiv(floormod(threadIdx.x, 49), 7)*9) + floormod(threadIdx.x, 7)) + 263)]*kernel.shared_1[((floordiv(threadIdx.x, 49)*36) + 899)]))
       }
     }
-    compute[((blockIdx.x*784) + threadIdx.x)] = max((conv2d_nchw_1[0] + bias[((blockIdx.x*16) + floordiv(threadIdx.x, 49))]), 0f32)
-    compute[(((blockIdx.x*784) + threadIdx.x) + 392)] = max((conv2d_nchw_1[1] + bias[(((blockIdx.x*16) + floordiv(threadIdx.x, 49)) + 8)]), 0f32)
+    compute[((blockIdx.x*1568) + threadIdx.x)] = max((conv2d_nchw_1[0] + bias[((blockIdx.x*32) + floordiv(threadIdx.x, 49))]), 0f32)
+    compute[(((blockIdx.x*1568) + threadIdx.x) + 392)] = max((conv2d_nchw_1[1] + bias[(((blockIdx.x*32) + floordiv(threadIdx.x, 49)) + 8)]), 0f32)
+    compute[(((blockIdx.x*1568) + threadIdx.x) + 784)] = max((conv2d_nchw_1[2] + bias[(((blockIdx.x*32) + floordiv(threadIdx.x, 49)) + 16)]), 0f32)
+    compute[(((blockIdx.x*1568) + threadIdx.x) + 1176)] = max((conv2d_nchw_1[3] + bias[(((blockIdx.x*32) + floordiv(threadIdx.x, 49)) + 24)]), 0f32)
   }
 }
 </pre></div>
@@ -585,7 +700,7 @@ cooperative fetching, unrolling and operator fusion.</p>
 <span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 0.313 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 0.328 ms
 </pre></div>
 </div>
 </div>
@@ -617,7 +732,7 @@ conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nch
 conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=1)
 conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=1)
 conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=8)
-conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=2)
+conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=4)
 conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1)
 conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
 conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=7)
@@ -628,17 +743,17 @@ conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_
 conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
 conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=2)
 conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=2)
-conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=1)
-conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=3)
-conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=1)
-conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=3)
+conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=3)
+conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
+conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=3)
+conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1)
 s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2d_nc [...]
 compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
 compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
 compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
 compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=1)
 compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=8)
-compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=2)
+compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=4)
 compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
 compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
 compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
@@ -670,7 +785,7 @@ pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fus
 s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
 pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=392)
 s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis(&quot;threadIdx.x&quot;))
-s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, &quot;auto_unroll_max_step&quot;, 64)
+s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, &quot;auto_unroll_max_step&quot;, 1024)
 s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, &quot;unroll_explicit&quot;, True)
 
 CUDA source code:
@@ -689,62 +804,173 @@ CUDA source code:
   #define uint64_t unsigned long long
 #endif
 extern &quot;C&quot; __global__ void __launch_bounds__(392) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
-  float conv2d_nchw[2];
+  float conv2d_nchw[4];
   __shared__ float pad_temp_shared[324];
-  __shared__ float kernel_shared[576];
+  __shared__ float kernel_shared[1152];
   conv2d_nchw[0] = 0.000000e+00f;
   conv2d_nchw[1] = 0.000000e+00f;
+  conv2d_nchw[2] = 0.000000e+00f;
+  conv2d_nchw[3] = 0.000000e+00f;
   for (int rc_outer_outer = 0; rc_outer_outer &lt; 128; ++rc_outer_outer) {
     __syncthreads();
     if (((int)threadIdx.x) &lt; 324) {
       pad_temp_shared[((int)threadIdx.x)] = (((((9 &lt;= (((int)threadIdx.x) % 81)) &amp;&amp; ((((int)threadIdx.x) % 81) &lt; 72)) &amp;&amp; (1 &lt;= (((int)threadIdx.x) % 9))) &amp;&amp; ((((int)threadIdx.x) % 9) &lt; 8)) ? data[(((((rc_outer_outer * 196) + ((((int)threadIdx.x) / 81) * 49)) + (((((int)threadIdx.x) % 81) / 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
     }
-    kernel_shared[((int)threadIdx.x)] = kernel[((((((int)blockIdx.x) * 73728) + ((((int)threadIdx.x) / 36) * 4608)) + (rc_outer_outer * 36)) + (((int)threadIdx.x) % 36))];
-    if (((int)threadIdx.x) &lt; 184) {
-      kernel_shared[(((int)threadIdx.x) + 392)] = kernel[((((((((int)blockIdx.x) * 73728) + (((((int)threadIdx.x) + 392) / 36) * 4608)) + (rc_outer_outer * 36)) + ((((((int)threadIdx.x) + 32) % 36) / 9) * 9)) + ((((((int)threadIdx.x) + 5) % 9) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+    kernel_shared[((int)threadIdx.x)] = kernel[((((((int)blockIdx.x) * 147456) + ((((int)threadIdx.x) / 36) * 4608)) + (rc_outer_outer * 36)) + (((int)threadIdx.x) % 36))];
+    kernel_shared[(((int)threadIdx.x) + 392)] = kernel[((((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 392) / 36) * 4608)) + (rc_outer_outer * 36)) + ((((((int)threadIdx.x) + 32) % 36) / 9) * 9)) + ((((((int)threadIdx.x) + 5) % 9) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
+    if (((int)threadIdx.x) &lt; 368) {
+      kernel_shared[(((int)threadIdx.x) + 784)] = kernel[(((((((int)blockIdx.x) * 147456) + (((((int)threadIdx.x) + 784) / 36) * 4608)) + (rc_outer_outer * 36)) + ((((((int)threadIdx.x) + 28) % 36) / 9) * 9)) + ((((int)threadIdx.x) + 1) % 9))];
     }
     __syncthreads();
-    for (int rc_outer_inner = 0; rc_outer_inner &lt; 2; ++rc_outer_inner) {
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18))]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7))] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 288)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 9)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 297)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 1)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 289)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 10)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 298)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 2)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 290)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 11)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 299)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 3)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 291)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 12)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 300)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 4)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 292)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 13)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 301)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 5)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 293)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 14)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 302)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 6)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 294)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 15)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 303)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 7)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 295)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 16)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 304)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 8)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 296)]));
-      conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 17)]));
-      conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((rc_outer_inner * 162) + (((((int)threadIdx.x) % 49) / 7) * 9)) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[((((((int)threadIdx.x) / 49) * 36) + (rc_outer_inner * 18)) + 305)]));
-    }
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7))] * kernel_shared[((((int)threadIdx.x) / 49) * 36)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7))] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 288)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7))] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 576)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7))] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 864)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 1)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 289)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 577)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 1)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 865)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 2)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 290)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 578)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 2)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 866)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 3)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 291)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 579)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 867)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 4)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 292)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 580)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 10)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 868)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 5)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 293)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 581)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 11)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 869)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 6)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 294)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 582)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 870)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 7)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 295)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 583)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 19)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 871)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 8)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 296)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 584)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 20)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 872)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 9)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 297)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 585)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 873)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 10)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 298)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 586)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 82)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 874)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 11)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 299)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 587)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 83)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 875)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 12)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 300)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 588)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 876)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 13)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 301)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 589)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 91)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 877)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 14)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 302)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 590)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 92)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 878)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 15)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 303)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 591)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 879)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 16)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 304)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 592)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 100)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 880)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 17)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 305)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 593)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 101)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 881)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 18)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 306)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 594)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 882)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 19)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 307)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 595)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 163)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 883)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 20)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 308)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 596)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 164)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 884)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 21)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 309)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 597)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 885)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 22)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 310)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 598)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 172)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 886)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 23)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 311)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 599)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 173)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 887)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 24)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 312)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 600)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 888)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 25)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 313)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 601)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 181)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 889)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 26)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 314)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 602)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 182)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 890)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 27)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 315)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 603)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 891)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 28)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 316)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 604)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 244)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 892)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 29)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 317)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 605)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 245)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 893)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 30)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 318)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 606)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 894)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 31)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 319)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 607)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 253)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 895)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 32)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 320)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 608)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 254)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 896)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 33)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 321)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 609)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 897)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 34)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 322)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 610)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 262)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 898)]));
+    conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 35)]));
+    conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 323)]));
+    conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 611)]));
+    conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((((((int)threadIdx.x) % 49) / 7) * 9) + (((int)threadIdx.x) % 7)) + 263)] * kernel_shared[(((((int)threadIdx.x) / 49) * 36) + 899)]));
   }
-  compute[((((int)blockIdx.x) * 784) + ((int)threadIdx.x))] = max((conv2d_nchw[0] + bias[((((int)blockIdx.x) * 16) + (((int)threadIdx.x) / 49))]), 0.000000e+00f);
-  compute[(((((int)blockIdx.x) * 784) + ((int)threadIdx.x)) + 392)] = max((conv2d_nchw[1] + bias[(((((int)blockIdx.x) * 16) + (((int)threadIdx.x) / 49)) + 8)]), 0.000000e+00f);
+  compute[((((int)blockIdx.x) * 1568) + ((int)threadIdx.x))] = max((conv2d_nchw[0] + bias[((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 49))]), 0.000000e+00f);
+  compute[(((((int)blockIdx.x) * 1568) + ((int)threadIdx.x)) + 392)] = max((conv2d_nchw[1] + bias[(((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 49)) + 8)]), 0.000000e+00f);
+  compute[(((((int)blockIdx.x) * 1568) + ((int)threadIdx.x)) + 784)] = max((conv2d_nchw[2] + bias[(((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 49)) + 16)]), 0.000000e+00f);
+  compute[(((((int)blockIdx.x) * 1568) + ((int)threadIdx.x)) + 1176)] = max((conv2d_nchw[3] + bias[(((((int)blockIdx.x) * 32) + (((int)threadIdx.x) / 49)) + 24)]), 0.000000e+00f);
 }
 </pre></div>
 </div>
@@ -780,7 +1006,7 @@ In the example below we resume the status and do more 5 trials.</p>
 Get devices for measurement successfully!
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 3 minutes  10.941 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 3 minutes  16.712 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autoscheduler-tune-conv2d-layer-cuda-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tune_conv2d_layer_cuda.py</span></code></a></p>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html b/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
index 16d41b10b..ab5a6e9ad 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html
@@ -906,7 +906,7 @@ so we can read the log file and load the best schedules.</p>
 Evaluate inference time cost...
 Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-   9.9161       9.9280       9.9583       9.8619       0.0403
+   9.9457       9.9355      10.0108       9.8908       0.0496
 </pre></div>
 </div>
 </div>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_network_x86.html b/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
index f8fd47474..973fdef26 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_network_x86.html
@@ -925,7 +925,7 @@ so we can read the log file and load the best schedules.</p>
 Evaluate inference time cost...
 Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
-  742.2515     742.3502     742.4684     741.9358      0.2284
+  764.6261     764.9087     765.3120     763.6577      0.7043
 </pre></div>
 </div>
 </div>
@@ -947,7 +947,7 @@ to learn how to use the RPC Tracker and RPC Server.
 To use the RPC Tracker in auto-scheduler, replace the runner in <code class="code docutils literal notranslate"><span class="pre">TuningOptions</span></code>
 with <a class="reference internal" href="../../reference/api/python/auto_scheduler.html#tvm.auto_scheduler.RPCRunner" title="tvm.auto_scheduler.RPCRunner"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.RPCRunner</span></code></a>.</p></li>
 </ol>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  20.822 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  23.975 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autoscheduler-tune-network-x86-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/e416b94ca1090b0897c0f6e0df95b911/tune_network_x86.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tune_network_x86.py</span></code></a></p>
diff --git a/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html b/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
index 99b71eee2..76d289b88 100644
--- a/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
+++ b/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html
@@ -625,30 +625,78 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
              placeholder_4: Buffer(placeholder_14: Pointer(float32), float32, [65536], []),
              compute: Buffer(compute_2: Pointer(float32), float32, [65536], [])}
   buffer_map = {placeholder_5: placeholder, placeholder_6: placeholder_1, placeholder_7: placeholder_2, placeholder_8: placeholder_3, placeholder_9: placeholder_4, compute_1: compute}
-  preflattened_buffer_map = {placeholder_6: placeholder_15: Buffer(placeholder_11, float32, [4916, 16, 1], []), placeholder_7: placeholder_16: Buffer(placeholder_12, int32, [4916], []), placeholder_5: placeholder_17: Buffer(placeholder_10, float32, [128, 256], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_9: placeholder_18: Buffer(placeholder_14, float32, [128, 512], []), placeholder_8: placeholder_19: Buffer(placeholder_13, int32, [33], [])} {
-  for (i0.outer.i1.outer.fused: int32, 0, 64) &quot;parallel&quot; {
-    allocate(compute_4: Pointer(global float32), float32, [1024]), storage_scope = global {
-      for (i.outer.inner: int32, 0, 8) {
+  preflattened_buffer_map = {placeholder_6: placeholder_15: Buffer(placeholder_11, float32, [4916, 16, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_5: placeholder_16: Buffer(placeholder_10, float32, [128, 256], []), placeholder_8: placeholder_17: Buffer(placeholder_13, int32, [33], []), placeholder_7: placeholder_18: Buffer(placeholder_12, int32, [4916], []), placeholder_9: placeholder_19: Buffer(placeholder_14, float32, [128, 512], [])} {
+  for (i0.outer.i1.outer.fused: int32, 0, 16) &quot;parallel&quot; {
+    allocate(compute_4: Pointer(global float32), float32, [4096]), storage_scope = global {
+      for (i.outer.inner: int32, 0, 4) {
         for (nb_j.inner: int32, 0, 2) {
-          for (i.inner.init: int32, 0, 4) {
-            for (j.init: int32, 0, 16) {
-              compute_5: Buffer(compute_4, float32, [1024], [])[((((i.outer.inner*128) + (i.inner.init*32)) + (nb_j.inner*16)) + j.init)] = 0f32
+          for (i.inner.init: int32, 0, 32) {
+            let cse_var_1: int32 = (((i.outer.inner*1024) + (i.inner.init*32)) + (nb_j.inner*16))
+             {
+              compute_5: Buffer(compute_4, float32, [4096], [])[cse_var_1] = 0f32
+              compute_5[(cse_var_1 + 1)] = 0f32
+              compute_5[(cse_var_1 + 2)] = 0f32
+              compute_5[(cse_var_1 + 3)] = 0f32
+              compute_5[(cse_var_1 + 4)] = 0f32
+              compute_5[(cse_var_1 + 5)] = 0f32
+              compute_5[(cse_var_1 + 6)] = 0f32
+              compute_5[(cse_var_1 + 7)] = 0f32
+              compute_5[(cse_var_1 + 8)] = 0f32
+              compute_5[(cse_var_1 + 9)] = 0f32
+              compute_5[(cse_var_1 + 10)] = 0f32
+              compute_5[(cse_var_1 + 11)] = 0f32
+              compute_5[(cse_var_1 + 12)] = 0f32
+              compute_5[(cse_var_1 + 13)] = 0f32
+              compute_5[(cse_var_1 + 14)] = 0f32
+              compute_5[(cse_var_1 + 15)] = 0f32
             }
           }
-          for (elem_idx: int32, 0, let cse_var_1: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner) in (placeholder_3[(cse_var_1 + 1)] - placeholder_3[cse_var_1])) {
-            for (i.inner: int32, 0, 4) {
-              for (j: int32, 0, 16) {
-                let cse_var_3: int32 = ((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)
-                let cse_var_2: int32 = ((((i.outer.inner*128) + (i.inner*32)) + (nb_j.inner*16)) + j)
-                compute_5[cse_var_2] = (compute_5[cse_var_2] + (placeholder_1[(((placeholder_3[cse_var_3]*16) + (elem_idx*16)) + j)]*max(placeholder[((((floordiv(i0.outer.i1.outer.fused, 16)*8192) + (i.outer.inner*1024)) + (i.inner*256)) + placeholder_2[(placeholder_3[cse_var_3] + elem_idx)])], 0f32)))
+          for (elem_idx: int32, 0, let cse_var_2: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner) in (placeholder_3[(cse_var_2 + 1)] - placeholder_3[cse_var_2])) {
+            for (i.inner: int32, 0, 32) {
+              let cse_var_21: int32 = (elem_idx*16)
+              let cse_var_20: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner)
+              let cse_var_19: int32 = ((i.outer.inner*8192) + (i.inner*256))
+              let cse_var_18: int32 = (((i.outer.inner*1024) + (i.inner*32)) + (nb_j.inner*16))
+              let cse_var_17: int32 = (cse_var_18 + 9)
+              let cse_var_16: int32 = (cse_var_18 + 8)
+              let cse_var_15: int32 = (cse_var_18 + 7)
+              let cse_var_14: int32 = (cse_var_18 + 6)
+              let cse_var_13: int32 = (cse_var_18 + 5)
+              let cse_var_12: int32 = (cse_var_18 + 4)
+              let cse_var_11: int32 = (cse_var_18 + 3)
+              let cse_var_10: int32 = (cse_var_18 + 2)
+              let cse_var_9: int32 = (cse_var_18 + 15)
+              let cse_var_8: int32 = (cse_var_18 + 14)
+              let cse_var_7: int32 = (cse_var_18 + 13)
+              let cse_var_6: int32 = (cse_var_18 + 12)
+              let cse_var_5: int32 = (cse_var_18 + 11)
+              let cse_var_4: int32 = (cse_var_18 + 10)
+              let cse_var_3: int32 = (cse_var_18 + 1)
+               {
+                compute_5[cse_var_18] = (compute_5[cse_var_18] + (placeholder_1[((placeholder_3[cse_var_20]*16) + cse_var_21)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_3] = (compute_5[cse_var_3] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 1)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_10] = (compute_5[cse_var_10] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 2)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_11] = (compute_5[cse_var_11] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 3)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_12] = (compute_5[cse_var_12] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 4)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_13] = (compute_5[cse_var_13] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 5)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_14] = (compute_5[cse_var_14] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 6)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_15] = (compute_5[cse_var_15] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 7)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_16] = (compute_5[cse_var_16] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 8)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_17] = (compute_5[cse_var_17] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 9)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_4] = (compute_5[cse_var_4] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 10)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_5] = (compute_5[cse_var_5] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 11)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_6] = (compute_5[cse_var_6] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 12)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_7] = (compute_5[cse_var_7] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 13)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_8] = (compute_5[cse_var_8] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 14)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                compute_5[cse_var_9] = (compute_5[cse_var_9] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 15)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
               }
             }
           }
         }
       }
-      for (i0.inner: int32, 0, 32) {
-        let cse_var_4: int32 = (((floordiv(i0.outer.i1.outer.fused, 16)*16384) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 16)*32))
-        compute[ramp(cse_var_4, 1, 32)] = max((compute_5[ramp((i0.inner*32), 1, 32)] + placeholder_4[ramp(cse_var_4, 1, 32)]), broadcast(0f32, 32))
+      for (i0.inner: int32, 0, 128) {
+        let cse_var_22: int32 = ((i0.inner*512) + (i0.outer.i1.outer.fused*32))
+        compute[ramp(cse_var_22, 1, 32)] = max((compute_5[ramp((i0.inner*32), 1, 32)] + placeholder_4[ramp(cse_var_22, 1, 32)]), broadcast(0f32, 32))
       }
     }
   }
@@ -686,7 +734,7 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
 <span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 1.452 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 1.744 ms
 </pre></div>
 </div>
 <div class="admonition note">
diff --git a/docs/how_to/tune_with_autotvm/sg_execution_times.html b/docs/how_to/tune_with_autotvm/sg_execution_times.html
index 78f3ad73c..1084be649 100644
--- a/docs/how_to/tune_with_autotvm/sg_execution_times.html
+++ b/docs/how_to/tune_with_autotvm/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-tune-with-autotvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:44.772</strong> total execution time for <strong>how_to_tune_with_autotvm</strong> files:</p>
+<p><strong>00:46.154</strong> total execution time for <strong>how_to_tune_with_autotvm</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 84%" />
@@ -336,11 +336,11 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_conv2d_cuda.html#sphx-glr-how-to-tune-with-autotvm-tune-conv2d-cuda-py"><span class="std std-ref">Tuning High Performance Convolution on NVIDIA GPUs</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_cuda.py</span></code>)</p></td>
-<td><p>00:44.741</p></td>
+<td><p>00:46.113</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tune_relay_x86.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-x86-py"><span class="std std-ref">Auto-tuning a Convolutional Network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_x86.py</span></code>)</p></td>
-<td><p>00:00.016</p></td>
+<td><p>00:00.026</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_relay_cuda.html#sphx-glr-how-to-tune-with-autotvm-tune-relay-cuda-py"><span class="std std-ref">Auto-tuning a Convolutional Network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_cuda.py</span></code>)</p></td>
diff --git a/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html b/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
index ba81c1e90..55dcd48ae 100644
--- a/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
+++ b/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html
@@ -1436,8 +1436,8 @@ No: 8   GFLOPS: 0.00/0.00       result: Traceback (most recent call last):
 TimeoutError
 
         [(&#39;tile_f&#39;, [-1, 2, 1, 64]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 1, 4]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,4909501
-No: 9   GFLOPS: 187.05/187.05   result: MeasureResult(costs=(0.001237657193548387,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.0630202293395996, timestamp=1659036206.659014)        [(&#39;tile_f&#39;, [-1, 1, 4, 8]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 2, 2]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,5072689
-No: 10  GFLOPS: 0.00/187.05     result: Traceback (most recent call last):
+No: 9   GFLOPS: 80.74/80.74     result: MeasureResult(costs=(0.002867256942857143,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.8604607582092285, timestamp=1659038348.8080869)       [(&#39;tile_f&#39;, [-1, 1, 4, 8]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 2, 2]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,5072689
+No: 10  GFLOPS: 0.00/80.74      result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -1560,8 +1560,8 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 4, 4, 8]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 7]), (&#39;tile_rc&#39;, [-1, 64, 2]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,5092711
-No: 11  GFLOPS: 261.45/261.45   result: MeasureResult(costs=(0.0008854458826815641,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.7113349437713623, timestamp=1659036207.5683029)      [(&#39;tile_f&#39;, [-1, 8, 2, 1]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 2, 1]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,4264713
-No: 12  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+No: 11  GFLOPS: 260.36/260.36   result: MeasureResult(costs=(0.0008891593646408842,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.494821548461914, timestamp=1659038349.724512)        [(&#39;tile_f&#39;, [-1, 8, 2, 1]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 2, 1]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,4264713
+No: 12  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -1684,7 +1684,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 128, 1, 2]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 1, 256]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 0)],None,183542
-No: 13  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+No: 13  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -1807,7 +1807,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 4, 8, 8]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 1, 64]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 0)],None,2482196
-No: 14  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+No: 14  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -1930,9 +1930,9 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 64, 1, 4]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 7]), (&#39;tile_rc&#39;, [-1, 4, 2]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,10306226
-No: 15  GFLOPS: 5.31/261.45     result: MeasureResult(costs=(0.04356790875,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.8453178405761719, timestamp=1659036212.0925148)      [(&#39;tile_f&#39;, [-1, 2, 2, 8]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 8]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,5330964
-No: 16  GFLOPS: 3.35/261.45     result: MeasureResult(costs=(0.06905426975000001,), error_no=MeasureErrorNo.NO_ERROR, all_cost=4.511449813842773, timestamp=1659036213.3184597) [(&#39;tile_f&#39;, [-1, 8, 4, 4]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 1, 1, 7]), (&#39;tile_rc&#39;, [-1, 4, 1]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 0)],None,2140058
-No: 17  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+No: 15  GFLOPS: 5.29/260.36     result: MeasureResult(costs=(0.04375233075,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.8625049591064453, timestamp=1659038354.3179812)      [(&#39;tile_f&#39;, [-1, 2, 2, 8]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 8]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,5330964
+No: 16  GFLOPS: 3.34/260.36     result: MeasureResult(costs=(0.06939617725,), error_no=MeasureErrorNo.NO_ERROR, all_cost=4.593501091003418, timestamp=1659038355.5618432)       [(&#39;tile_f&#39;, [-1, 8, 4, 4]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 1, 1, 7]), (&#39;tile_rc&#39;, [-1, 4, 1]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 0)],None,2140058
+No: 17  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 142, in build
     res = future.result()
   File &quot;/usr/lib/python3.7/concurrent/futures/_base.py&quot;, line 435, in result
@@ -1950,8 +1950,8 @@ No: 17  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
 TimeoutError
 
         [(&#39;tile_f&#39;, [-1, 2, 2, 1]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 16]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,10195251
-No: 18  GFLOPS: 28.17/261.45    result: MeasureResult(costs=(0.008217329785714286,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.2581093311309814, timestamp=1659036224.3556004)       [(&#39;tile_f&#39;, [-1, 4, 8, 4]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 1, 4]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,6068603
-No: 19  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+No: 18  GFLOPS: 27.95/260.36    result: MeasureResult(costs=(0.008283954928571428,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.2748453617095947, timestamp=1659038366.575956)        [(&#39;tile_f&#39;, [-1, 4, 8, 4]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 1, 4]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,6068603
+No: 19  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -2074,7 +2074,7 @@ Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 871, in verify_pass
     raise InstantiationError(&quot;Skipped because of invalid gpu kernel&quot;)
 tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [(&#39;tile_f&#39;, [-1, 16, 4, 8]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 128]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 3]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,6956993
-No: 20  GFLOPS: 0.00/261.45     result: Traceback (most recent call last):
+No: 20  GFLOPS: 0.00/260.36     result: Traceback (most recent call last):
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 588, in __call__
     func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
   File &quot;/workspace/python/tvm/autotvm/measure/measure_methods.py&quot;, line 540, in _build_func_common
@@ -2237,7 +2237,7 @@ and measure running time.</p>
 Best config:
 [(&#39;tile_f&#39;, [-1, 8, 2, 1]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 2, 1]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 0)],None,4264713
 Finish loading 20 records
-Time cost of this operator: 0.001307
+Time cost of this operator: 0.001256
 </pre></div>
 </div>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-tune-with-autotvm-tune-conv2d-cuda-py">
diff --git a/docs/how_to/work_with_microtvm/index.html b/docs/how_to/work_with_microtvm/index.html
index e6bff75ce..7d98c91a8 100644
--- a/docs/how_to/work_with_microtvm/index.html
+++ b/docs/how_to/work_with_microtvm/index.html
@@ -48,7 +48,7 @@
     <script type="text/javascript" src="../../_static/js/tlcpack_theme.js"></script>
     <link rel="index" title="Index" href="../../genindex.html" />
     <link rel="search" title="Search" href="../../search.html" />
-    <link rel="next" title="Autotuning with microTVM" href="micro_autotune.html" />
+    <link rel="next" title="microTVM Host-Driven AoT" href="micro_aot.html" />
     <link rel="prev" title="Auto-scheduling Sparse Matrix Multiplication on CPU with Custom Sketch Rule" href="../tune_with_autoscheduler/tune_sparse_x86.html" /> 
 </head>
 
@@ -238,6 +238,7 @@
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autotvm/index.html">Auto-Tune with Templates and AutoTVM</a></li>
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autoscheduler/index.html">Use AutoScheduler for Template-Free Scheduling</a></li>
 <li class="toctree-l2 current"><a class="current reference internal" href="#">Work With microTVM</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="micro_aot.html">microTVM Host-Driven AoT</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_autotune.html">Autotuning with microTVM</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_ethosu.html">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_reference_vm.html">microTVM Reference Virtual Machines</a></li>
@@ -356,7 +357,10 @@
 <p>microTVM enables inference on bare-metal platforms, for example, those without
 a traditional Operating System such as Linux, OS X, or Windows. These how-tos
 demonstrate how to tune and deploy models with microTVM.</p>
-<div class="sphx-glr-thumbnails"><div class="sphx-glr-thumbcontainer" tooltip="This tutorial explains how to autotune a model using the C runtime."><img alt="Autotuning with microTVM" src="../../_images/sphx_glr_micro_autotune_thumb.png" />
+<div class="sphx-glr-thumbnails"><div class="sphx-glr-thumbcontainer" tooltip="This tutorial is showcasing microTVM host-driven AoT compilation with a TFLite model. AoTExecut..."><img alt="microTVM Host-Driven AoT" src="../../_images/sphx_glr_micro_aot_thumb.png" />
+<p><a class="reference internal" href="micro_aot.html#sphx-glr-how-to-work-with-microtvm-micro-aot-py"><span class="std std-ref">microTVM Host-Driven AoT</span></a></p>
+  <div class="sphx-glr-thumbnail-title">microTVM Host-Driven AoT</div>
+</div><div class="sphx-glr-thumbcontainer" tooltip="This tutorial explains how to autotune a model using the C runtime."><img alt="Autotuning with microTVM" src="../../_images/sphx_glr_micro_autotune_thumb.png" />
 <p><a class="reference internal" href="micro_autotune.html#sphx-glr-how-to-work-with-microtvm-micro-autotune-py"><span class="std std-ref">Autotuning with microTVM</span></a></p>
   <div class="sphx-glr-thumbnail-title">Autotuning with microTVM</div>
 </div><div class="sphx-glr-thumbcontainer" tooltip="This section contains an example of how to use TVM to run a model on an Arm(R) Cortex(R)-M55 CP..."><img alt="Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN" src="../../_images/sphx_glr_micro_ethosu_thumb.png" />
@@ -389,7 +393,7 @@ demonstrate how to tune and deploy models with microTVM.</p>
 
     <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
       
-        <a href="micro_autotune.html" class="btn btn-neutral float-right" title="Autotuning with microTVM" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
+        <a href="micro_aot.html" class="btn btn-neutral float-right" title="microTVM Host-Driven AoT" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
       
       
         <a href="../tune_with_autoscheduler/tune_sparse_x86.html" class="btn btn-neutral float-left" title="Auto-scheduling Sparse Matrix Multiplication on CPU with Custom Sketch Rule" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
diff --git a/docs/how_to/work_with_microtvm/micro_aot.html b/docs/how_to/work_with_microtvm/micro_aot.html
new file mode 100644
index 000000000..fecf932fc
--- /dev/null
+++ b/docs/how_to/work_with_microtvm/micro_aot.html
@@ -0,0 +1,605 @@
+
+
+
+
+
+
+<!DOCTYPE html>
+<html class="writer-html5" lang="en" >
+<head>
+  <meta charset="utf-8">
+  
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  
+  <title>microTVM Host-Driven AoT &mdash; tvm 0.10.dev0 documentation</title>
+  
+
+  
+  <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css" integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">
+  <link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/pygments.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/sg_gallery.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/sg_gallery-binder.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/sg_gallery-dataframe.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/sg_gallery-rendered-html.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/pygments.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/css/tlcpack_theme.css" type="text/css" />
+
+  
+  
+    <link rel="shortcut icon" href="../../_static/tvm-logo-square.png"/>
+  
+
+  
+  
+  
+  
+    
+      <script type="text/javascript" id="documentation_options" data-url_root="../../" src="../../_static/documentation_options.js"></script>
+        <script data-url_root="../../" id="documentation_options" src="../../_static/documentation_options.js"></script>
+        <script src="../../_static/jquery.js"></script>
+        <script src="../../_static/underscore.js"></script>
+        <script src="../../_static/doctools.js"></script>
+    
+    <script type="text/javascript" src="../../_static/js/theme.js"></script>
+
+    
+    <script type="text/javascript" src="../../_static/js/tlcpack_theme.js"></script>
+    <link rel="index" title="Index" href="../../genindex.html" />
+    <link rel="search" title="Search" href="../../search.html" />
+    <link rel="next" title="Autotuning with microTVM" href="micro_autotune.html" />
+    <link rel="prev" title="Work With microTVM" href="index.html" /> 
+</head>
+
+<body class="wy-body-for-nav">
+
+   
+  <div class="wy-grid-for-nav">
+    
+    
+<header class="header">
+    <div class="innercontainer">
+      <div class="headerInner d-flex justify-content-between align-items-center">
+          <div class="headerLogo">
+               <a href="https://tvm.apache.org/"><img src=https://tvm.apache.org/assets/images/logo.svg alt="logo"></a>
+          </div>
+
+          <div id="headMenu" class="headerNav">
+            <button type="button" id="closeHeadMenu" class="navCloseBtn"><img src="../../_static/img/close-icon.svg" alt="Close"></button>
+             <ul class="nav">
+                <li class="nav-item">
+                   <a class="nav-link" href=https://tvm.apache.org/community>Community</a>
+                </li>
+                <li class="nav-item">
+                   <a class="nav-link" href=https://tvm.apache.org/download>Download</a>
+                </li>
+                <li class="nav-item">
+                   <a class="nav-link" href=https://tvm.apache.org/vta>VTA</a>
+                </li>
+                <li class="nav-item">
+                   <a class="nav-link" href=https://tvm.apache.org/blog>Blog</a>
+                </li>
+                <li class="nav-item">
+                   <a class="nav-link" href=https://tvm.apache.org/docs>Docs</a>
+                </li>
+                <li class="nav-item">
+                   <a class="nav-link" href=https://tvmconf.org>Conference</a>
+                </li>
+                <li class="nav-item">
+                   <a class="nav-link" href=https://github.com/apache/tvm/>Github</a>
+                </li>
+             </ul>
+               <div class="responsivetlcdropdown">
+                 <button type="button" class="btn-link">
+                   ASF
+                 </button>
+                 <ul>
+                     <li>
+                       <a href=https://apache.org/>Apache Homepage</a>
+                     </li>
+                     <li>
+                       <a href=https://www.apache.org/licenses/>License</a>
+                     </li>
+                     <li>
+                       <a href=https://www.apache.org/foundation/sponsorship.html>Sponsorship</a>
+                     </li>
+                     <li>
+                       <a href=https://www.apache.org/security/>Security</a>
+                     </li>
+                     <li>
+                       <a href=https://www.apache.org/foundation/thanks.html>Thanks</a>
+                     </li>
+                     <li>
+                       <a href=https://www.apache.org/events/current-event>Events</a>
+                     </li>
+                 </ul>
+               </div>
+          </div>
+            <div class="responsiveMenuIcon">
+              <button type="button" id="menuBtn" class="btn-menu"><img src="../../_static/img/menu-icon.svg" alt="Menu Icon"></button>
+            </div>
+
+            <div class="tlcDropdown">
+              <div class="dropdown">
+                <button type="button" class="btn-link dropdown-toggle" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
+                  ASF
+                </button>
+                <div class="dropdown-menu dropdown-menu-right">
+                  <ul>
+                     <li>
+                       <a href=https://apache.org/>Apache Homepage</a>
+                     </li>
+                     <li>
+                       <a href=https://www.apache.org/licenses/>License</a>
+                     </li>
+                     <li>
+                       <a href=https://www.apache.org/foundation/sponsorship.html>Sponsorship</a>
+                     </li>
+                     <li>
+                       <a href=https://www.apache.org/security/>Security</a>
+                     </li>
+                     <li>
+                       <a href=https://www.apache.org/foundation/thanks.html>Thanks</a>
+                     </li>
+                     <li>
+                       <a href=https://www.apache.org/events/current-event>Events</a>
+                     </li>
+                  </ul>
+                </div>
+              </div>
+          </div>
+       </div>
+    </div>
+ </header>
+ 
+    <nav data-toggle="wy-nav-shift" class="wy-nav-side fixed">
+      <div class="wy-side-scroll">
+        <div class="wy-side-nav-search" >
+          
+
+          
+            <a href="../../index.html">
+          
+
+          
+            
+            <img src="../../_static/tvm-logo-small.png" class="logo" alt="Logo"/>
+          
+          </a>
+
+          
+            
+            
+              <input type="checkbox" class="version-toggle-box" hidden id="version-toggle">
+              <label for="version-toggle" class="version-toggle-label">
+                  <div tabindex="0" class="version version-selector version-selector-show">
+                    0.10.dev0 <span class="chevron versions-hidden"><svg fill="none" height="24" viewBox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg"><path d="m8 4 8 8-8 8" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width="2"/></svg></span><span class="chevron versions-shown"><svg fill="none" height="24" viewBox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg"><path d="m4 8 8 8 8-8" stroke="#000" stroke-linecap="round" stroke-linejoin="round [...]
+                  </div>
+                </label>
+                <div class="version-details wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
+                  <p class="caption" role="heading"><span class="caption-text">Versions</span></p>
+                  <ol style="text-align: left">
+                    
+                    
+                    
+                    
+                      <li><div class="version"><a style="font-size: 0.8em; padding: 4px" href="/">0.10.dev0 (main)</a></div></li>
+                    
+                    
+                    
+                    
+                      <li><div class="version"><a style="font-size: 0.8em; padding: 4px" href="v0.8.0/">v0.8.0</a></div></li>
+                    
+                    
+                    
+                    
+                      <li><div class="version"><a style="font-size: 0.8em; padding: 4px" href="v0.9.0/">v0.9.0</a></div></li>
+                    
+                  </ol>
+                </div>
+            
+          
+
+          
+<div role="search">
+  <form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+    <input type="hidden" name="check_keywords" value="yes" />
+    <input type="hidden" name="area" value="default" />
+  </form>
+</div>
+
+          
+        </div>
+
+        
+        <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
+          
+            
+            
+              
+            
+            
+              <p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../../install/index.html">Installing TVM</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../contribute/index.html">Contributor Guide</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">User Guide</span></p>
+<ul class="current">
+<li class="toctree-l1"><a class="reference internal" href="../../tutorial/index.html">User Tutorial</a></li>
+<li class="toctree-l1 current"><a class="reference internal" href="../index.html">How To Guides</a><ul class="current">
+<li class="toctree-l2"><a class="reference internal" href="../compile_models/index.html">Compile Deep Learning Models</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../deploy/index.html">Deploy Models and Integrate TVM</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../work_with_relay/index.html">Work With Relay</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../work_with_schedules/index.html">Work With Tensor Expression and Schedules</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../optimize_operators/index.html">Optimize Tensor Operators</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../tune_with_autotvm/index.html">Auto-Tune with Templates and AutoTVM</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../tune_with_autoscheduler/index.html">Use AutoScheduler for Template-Free Scheduling</a></li>
+<li class="toctree-l2 current"><a class="reference internal" href="index.html">Work With microTVM</a><ul class="current">
+<li class="toctree-l3 current"><a class="current reference internal" href="#">microTVM Host-Driven AoT</a><ul>
+<li class="toctree-l4"><a class="reference internal" href="#import-a-tflite-model">Import a TFLite model</a></li>
+<li class="toctree-l4"><a class="reference internal" href="#defining-the-target">Defining the target</a></li>
+<li class="toctree-l4"><a class="reference internal" href="#compile-the-model">Compile the model</a></li>
+<li class="toctree-l4"><a class="reference internal" href="#create-a-microtvm-project">Create a microTVM project</a></li>
+<li class="toctree-l4"><a class="reference internal" href="#build-flash-and-execute-the-model">Build, flash and execute the model</a></li>
+</ul>
+</li>
+<li class="toctree-l3"><a class="reference internal" href="micro_autotune.html">Autotuning with microTVM</a></li>
+<li class="toctree-l3"><a class="reference internal" href="micro_ethosu.html">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</a></li>
+<li class="toctree-l3"><a class="reference internal" href="micro_reference_vm.html">microTVM Reference Virtual Machines</a></li>
+<li class="toctree-l3"><a class="reference internal" href="micro_tflite.html">microTVM with TFLite Models</a></li>
+<li class="toctree-l3"><a class="reference internal" href="micro_train.html">Training Vision Models for microTVM on Arduino</a></li>
+<li class="toctree-l3"><a class="reference internal" href="micro_tvmc.html">Executing a Tiny Model with TVMC Micro</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../extend_tvm/index.html">Extend TVM</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../profile/index.html">Profile Models</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../errors.html">Handle TVM Errors</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../faq.html">Frequently Asked Questions</a></li>
+</ul>
+</li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../../dev/tutorial/index.html">Developer Tutorial</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../dev/how_to/how_to.html">Developer How-To Guide</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Architecture  Guide</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../../arch/index.html">Design and Architecture</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Topic Guides</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../../topic/microtvm/index.html">microTVM: TVM on bare-metal</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../topic/vta/index.html">VTA: Versatile Tensor Accelerator</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Reference Guide</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../../reference/langref/index.html">Language Reference</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../reference/api/python/index.html">Python API</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../reference/api/links.html">Other APIs</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../reference/publications.html">Publications</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../genindex.html">Index</a></li>
+</ul>
+
+            
+          
+        </div>
+        
+      </div>
+    </nav>
+
+    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
+      
+      <nav class="wy-nav-top" aria-label="top navigation" data-toggle="wy-nav-top">
+        
+            <div class="togglemenu">
+
+            </div>
+            <div class="nav-content">
+              <!-- tvm -->
+              Table of Contents
+            </div>
+        
+      </nav>
+
+
+      <div class="wy-nav-content">
+        
+        <div class="rst-content">
+        
+
+          
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+<div role="navigation" aria-label="breadcrumbs navigation">
+
+  <ul class="wy-breadcrumbs">
+    
+      <li><a href="../../index.html">Docs</a> <span class="br-arrow">></span></li>
+        
+          <li><a href="../index.html">How To Guides</a> <span class="br-arrow">></span></li>
+        
+          <li><a href="index.html">Work With microTVM</a> <span class="br-arrow">></span></li>
+        
+      <li>microTVM Host-Driven AoT</li>
+    
+    
+      <li class="wy-breadcrumbs-aside">
+        
+            
+            <a href="../../_sources/how_to/work_with_microtvm/micro_aot.rst.txt" rel="nofollow"> <img src="../../_static/img/source.svg" alt="viewsource"/></a>
+          
+        
+      </li>
+    
+  </ul>
+
+  
+  <hr/>
+</div>
+          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
+           <div itemprop="articleBody">
+            
+  <div class="sphx-glr-download-link-note admonition note">
+<p class="admonition-title">Note</p>
+<p>Click <a class="reference internal" href="#sphx-glr-download-how-to-work-with-microtvm-micro-aot-py"><span class="std std-ref">here</span></a>
+to download the full example code</p>
+</div>
+<div class="sphx-glr-example-title section" id="microtvm-host-driven-aot">
+<span id="tutorial-micro-aot"></span><span id="sphx-glr-how-to-work-with-microtvm-micro-aot-py"></span><h1>microTVM Host-Driven AoT<a class="headerlink" href="#microtvm-host-driven-aot" title="Permalink to this headline">¶</a></h1>
+<p><strong>Authors</strong>:
+<a class="reference external" href="https://github.com/mehrdadh">Mehrdad Hessar</a>,
+<a class="reference external" href="https://github.com/alanmacd">Alan MacDonald</a></p>
+<p>This tutorial is showcasing microTVM host-driven AoT compilation with
+a TFLite model. AoTExecutor reduces the overhead of parsing graph at runtime
+compared to GraphExecutor. Also, we can have better memory management using ahead
+of time compilation. This tutorial can be executed on a x86 CPU using C runtime (CRT)
+or on Zephyr platform on a microcontroller/board supported by Zephyr.</p>
+<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
+<span class="kn">import</span> <span class="nn">pathlib</span>
+<span class="kn">import</span> <span class="nn">json</span>
+<span class="kn">import</span> <span class="nn">os</span>
+
+<span class="kn">import</span> <span class="nn">tvm</span>
+<span class="kn">from</span> <span class="nn">tvm</span> <span class="kn">import</span> <span class="n">relay</span>
+<span class="kn">from</span> <span class="nn">tvm.relay.backend</span> <span class="kn">import</span> <span class="n">Executor</span><span class="p">,</span> <span class="n">Runtime</span>
+<span class="kn">from</span> <span class="nn">tvm.contrib.download</span> <span class="kn">import</span> <span class="n">download_testdata</span>
+</pre></div>
+</div>
+<div class="section" id="import-a-tflite-model">
+<h2>Import a TFLite model<a class="headerlink" href="#import-a-tflite-model" title="Permalink to this headline">¶</a></h2>
+<p>To begin with, download and import a Keyword Spotting TFLite model.
+This model is originally from <a class="reference external" href="https://github.com/mlcommons/tiny">MLPerf Tiny repository</a>.
+To test this model, we use samples from <a class="reference external" href="https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html">KWS dataset provided by Google</a>.</p>
+<p><strong>Note:</strong> By default this tutorial runs on x86 CPU using CRT, if you would like to run on Zephyr platform
+you need to export <cite>TVM_MICRO_USE_HW</cite> environment variable.</p>
+<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><a href="https://docs.python.org/3/library/functions.html#bool" title="builtins.bool" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">use_physical_hw</span></a> <span class="o">=</span> <span class="nb">bool</span><span class="p">(</span><a href="https://docs.python.org/3/library/os.html#os.getenv" title="os.getenv" class="sphx-glr-backref- [...]
+<a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">MODEL_URL</span></a> <span class="o">=</span> <span class="s2">&quot;https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/model/keyword_spotting_quant.tflite&quot;</span>
+<a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">MODEL_PATH</span></a> <span class="o">=</span> <span class="n">download_testdata</span><span class="p">(</span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><sp [...]
+<a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">SAMPLE_URL</span></a> <span class="o">=</span> <span class="s2">&quot;https://github.com/tlc-pack/web-data/raw/main/testdata/microTVM/data/keyword_spotting_int8_6.pyc.npy&quot;</span>
+<a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">SAMPLE_PATH</span></a> <span class="o">=</span> <span class="n">download_testdata</span><span class="p">(</span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><s [...]
+
+<a href="https://docs.python.org/3/library/stdtypes.html#bytes" title="builtins.bytes" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">tflite_model_buf</span></a> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span [...]
+<span class="k">try</span><span class="p">:</span>
+    <span class="kn">import</span> <span class="nn">tflite</span>
+
+    <span class="n">tflite_model</span> <span class="o">=</span> <span class="n">tflite</span><span class="o">.</span><span class="n">Model</span><span class="o">.</span><span class="n">GetRootAsModel</span><span class="p">(</span><a href="https://docs.python.org/3/library/stdtypes.html#bytes" title="builtins.bytes" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">tflite_model_buf</span></a><span class="p">,</span> <span cl [...]
+<span class="k">except</span> <span class="ne">AttributeError</span><span class="p">:</span>
+    <span class="kn">import</span> <span class="nn">tflite.Model</span>
+
+    <span class="n">tflite_model</span> <span class="o">=</span> <span class="n">tflite</span><span class="o">.</span><span class="n">Model</span><span class="o">.</span><span class="n">Model</span><span class="o">.</span><span class="n">GetRootAsModel</span><span class="p">(</span><a href="https://docs.python.org/3/library/stdtypes.html#bytes" title="builtins.bytes" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">tflite_m [...]
+
+<a href="https://docs.python.org/3/library/stdtypes.html#tuple" title="builtins.tuple" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">input_shape</span></a> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">49</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
+<a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">INPUT_NAME</span></a> <span class="o">=</span> <span class="s2">&quot;input_1&quot;</span>
+<span class="n">relay_mod</span><span class="p">,</span> <a href="https://docs.python.org/3/library/stdtypes.html#dict" title="builtins.dict" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">params</span></a> <span class="o">=</span> <a href="../../reference/api/python/relay/frontend.html#tvm.relay.frontend.from_tflite" title="tvm.relay.frontend.from_tflite" class="sphx-glr-backref-module-tvm-relay-frontend sphx-glr-backref [...]
+    <span class="n">tflite_model</span><span class="p">,</span> <span class="n">shape_dict</span><span class="o">=</span><span class="p">{</span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">INPUT_NAME</span></a><span class="p">:</span> <a href="https://docs.python.org/3/library/stdtypes.html#tuple" title="builtins.tuple" class="sphx-glr-b [...]
+<span class="p">)</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="defining-the-target">
+<h2>Defining the target<a class="headerlink" href="#defining-the-target" title="Permalink to this headline">¶</a></h2>
+<p>Now we need to define the target, runtime and executor. In this tutorial, we focused on
+using AOT host driven executor. We use the host micro target which is for running a model
+on x86 CPU using CRT runtime or running a model with Zephyr platform on qemu_x86 simulator
+board. In the case of a physical microcontroller, we get the target model for the physical
+board (E.g. nucleo_l4r5zi) and pass it to <cite>tvm.target.target.micro</cite> to create a full
+micro target.</p>
+<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1"># Use the C runtime (crt) and enable static linking by setting system-lib to True</span>
+<span class="n">RUNTIME</span> <span class="o">=</span> <span class="n">Runtime</span><span class="p">(</span><span class="s2">&quot;crt&quot;</span><span class="p">,</span> <span class="p">{</span><span class="s2">&quot;system-lib&quot;</span><span class="p">:</span> <span class="kc">True</span><span class="p">})</span>
+
+<span class="c1"># Simulate a microcontroller on the host machine. Uses the main() from `src/runtime/crt/host/main.cc &lt;https://github.com/apache/tvm/blob/main/src/runtime/crt/host/main.cc&gt;`_.</span>
+<span class="c1"># To use physical hardware, replace &quot;host&quot; with something matching your hardware.</span>
+<a href="../../reference/api/python/target.html#tvm.target.Target" title="tvm.target.Target" class="sphx-glr-backref-module-tvm-target sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">TARGET</span></a> <span class="o">=</span> <span class="n">tvm</span><span class="o">.</span><span class="n">target</span><span class="o">.</span><span class="n">target</span><span class="o">.</span><span class="n">micro</span><span class="p">(</span><span class="s2">&quot;host&quot [...]
+
+<span class="c1"># Use the AOT executor rather than graph or vm executors. Don&#39;t use unpacked API or C calling style.</span>
+<span class="n">EXECUTOR</span> <span class="o">=</span> <span class="n">Executor</span><span class="p">(</span><span class="s2">&quot;aot&quot;</span><span class="p">)</span>
+
+<span class="k">if</span> <a href="https://docs.python.org/3/library/functions.html#bool" title="builtins.bool" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">use_physical_hw</span></a><span class="p">:</span>
+    <span class="n">boards_file</span> <span class="o">=</span> <a href="https://docs.python.org/3/library/pathlib.html#pathlib.Path" title="pathlib.Path" class="sphx-glr-backref-module-pathlib sphx-glr-backref-type-py-class"><span class="n">pathlib</span><span class="o">.</span><span class="n">Path</span></a><span class="p">(</span><a href="../../reference/api/python/micro.html#tvm.micro.get_microtvm_template_projects" title="tvm.micro.get_microtvm_template_projects" class="sphx-glr-bac [...]
+    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">boards_file</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
+        <span class="n">boards</span> <span class="o">=</span> <a href="https://docs.python.org/3/library/json.html#json.load" title="json.load" class="sphx-glr-backref-module-json sphx-glr-backref-type-py-function"><span class="n">json</span><span class="o">.</span><span class="n">load</span></a><span class="p">(</span><span class="n">f</span><span class="p">)</span>
+    <span class="n">BOARD</span> <span class="o">=</span> <a href="https://docs.python.org/3/library/os.html#os.getenv" title="os.getenv" class="sphx-glr-backref-module-os sphx-glr-backref-type-py-function"><span class="n">os</span><span class="o">.</span><span class="n">getenv</span></a><span class="p">(</span><span class="s2">&quot;TVM_MICRO_BOARD&quot;</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s2">&quot;nucleo_l4r5zi&quot;</span> [...]
+    <a href="../../reference/api/python/target.html#tvm.target.Target" title="tvm.target.Target" class="sphx-glr-backref-module-tvm-target sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">TARGET</span></a> <span class="o">=</span> <span class="n">tvm</span><span class="o">.</span><span class="n">target</span><span class="o">.</span><span class="n">target</span><span class="o">.</span><span class="n">micro</span><span class="p">(</span><span class="n">boards</span [...]
+</pre></div>
+</div>
+</div>
+<div class="section" id="compile-the-model">
+<h2>Compile the model<a class="headerlink" href="#compile-the-model" title="Permalink to this headline">¶</a></h2>
+<p>Now, we compile the model for the target:</p>
+<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="k">with</span> <a href="../../reference/api/python/ir.html#tvm.transform.PassContext" title="tvm.transform.PassContext" class="sphx-glr-backref-module-tvm-transform sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">tvm</span><span class="o">.</span><span class="n">transform</span><span class="o">.</span><span class="n">PassContext</span></a><span class="p">(</span><spa [...]
+    <span class="n">module</span> <span class="o">=</span> <span class="n">tvm</span><span class="o">.</span><span class="n">relay</span><span class="o">.</span><span class="n">build</span><span class="p">(</span>
+        <span class="n">relay_mod</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><a href="../../reference/api/python/target.html#tvm.target.Target" title="tvm.target.Target" class="sphx-glr-backref-module-tvm-target sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">TARGET</span></a><span class="p">,</span> <a href="https://docs.python.org/3/library/stdtypes.html#dict" title="builtins.dict" class="sphx-glr-backref-module-builtin [...]
+    <span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
+  &quot;target_host parameter is going to be deprecated. &quot;
+</pre></div>
+</div>
+</div>
+<div class="section" id="create-a-microtvm-project">
+<h2>Create a microTVM project<a class="headerlink" href="#create-a-microtvm-project" title="Permalink to this headline">¶</a></h2>
+<p>Now that we have the compiled model as an IRModule, we need to create a firmware project
+to use the compiled model with microTVM. To do this, we use Project API. We have defined
+CRT and Zephyr microTVM template projects which are used for x86 CPU and Zephyr boards
+respectively.</p>
+<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><a href="https://docs.python.org/3/library/pathlib.html#pathlib.PosixPath" title="pathlib.PosixPath" class="sphx-glr-backref-module-pathlib sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">template_project_path</span></a> <span class="o">=</span> <a href="https://docs.python.org/3/library/pathlib.html#pathlib.Path" title="pathlib.Path" class="sphx-glr-backref-module-pathlib sphx-g [...]
+<a href="https://docs.python.org/3/library/stdtypes.html#dict" title="builtins.dict" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">project_options</span></a> <span class="o">=</span> <span class="p">{}</span>  <span class="c1"># You can use options to provide platform-specific options through TVM.</span>
+
+<span class="k">if</span> <a href="https://docs.python.org/3/library/functions.html#bool" title="builtins.bool" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">use_physical_hw</span></a><span class="p">:</span>
+    <a href="https://docs.python.org/3/library/pathlib.html#pathlib.PosixPath" title="pathlib.PosixPath" class="sphx-glr-backref-module-pathlib sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">template_project_path</span></a> <span class="o">=</span> <a href="https://docs.python.org/3/library/pathlib.html#pathlib.Path" title="pathlib.Path" class="sphx-glr-backref-module-pathlib sphx-glr-backref-type-py-class"><span class="n">pathlib</span><span class="o">.</span> [...]
+    <a href="https://docs.python.org/3/library/stdtypes.html#dict" title="builtins.dict" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">project_options</span></a> <span class="o">=</span> <span class="p">{</span><span class="s2">&quot;project_type&quot;</span><span class="p">:</span> <span class="s2">&quot;host_driven&quot;</span><span class="p">,</span> <span class="s2">&quot;zephyr_board&quot;</span><span class="p">:</s [...]
+
+<a href="../../reference/api/python/contrib.html#tvm.contrib.utils.TempDirectory" title="tvm.contrib.utils.TempDirectory" class="sphx-glr-backref-module-tvm-contrib-utils sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">temp_dir</span></a> <span class="o">=</span> <a href="../../reference/api/python/contrib.html#tvm.contrib.utils.tempdir" title="tvm.contrib.utils.tempdir" class="sphx-glr-backref-module-tvm-contrib-utils sphx-glr-backref-type-py-function"><span cl [...]
+<a href="https://docs.python.org/3/library/pathlib.html#pathlib.PosixPath" title="pathlib.PosixPath" class="sphx-glr-backref-module-pathlib sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">generated_project_dir</span></a> <span class="o">=</span> <a href="../../reference/api/python/contrib.html#tvm.contrib.utils.TempDirectory" title="tvm.contrib.utils.TempDirectory" class="sphx-glr-backref-module-tvm-contrib-utils sphx-glr-backref-type-py-class sphx-glr-backref-i [...]
+<a href="../../reference/api/python/micro.html#tvm.micro.GeneratedProject" title="tvm.micro.GeneratedProject" class="sphx-glr-backref-module-tvm-micro sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">project</span></a> <span class="o">=</span> <a href="../../reference/api/python/micro.html#tvm.micro.generate_project" title="tvm.micro.generate_project" class="sphx-glr-backref-module-tvm-micro sphx-glr-backref-type-py-function"><span class="n">tvm</span><span class [...]
+    <a href="https://docs.python.org/3/library/pathlib.html#pathlib.PosixPath" title="pathlib.PosixPath" class="sphx-glr-backref-module-pathlib sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">template_project_path</span></a><span class="p">,</span> <span class="n">module</span><span class="p">,</span> <a href="https://docs.python.org/3/library/pathlib.html#pathlib.PosixPath" title="pathlib.PosixPath" class="sphx-glr-backref-module-pathlib sphx-glr-backref-type-p [...]
+<span class="p">)</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="build-flash-and-execute-the-model">
+<h2>Build, flash and execute the model<a class="headerlink" href="#build-flash-and-execute-the-model" title="Permalink to this headline">¶</a></h2>
+<p>Next, we build the microTVM project and flash it. Flash step is specific to
+physical microcontrollers and it is skipped if it is simulating a microcontroller
+via the host main.cc or if a Zephyr emulated board is selected as the target.
+Next, we define the labels for the model output and execute the model with a
+sample with expected value of 6 (label: left).</p>
+<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><a href="../../reference/api/python/micro.html#tvm.micro.GeneratedProject" title="tvm.micro.GeneratedProject" class="sphx-glr-backref-module-tvm-micro sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">project</span></a><span class="o">.</span><span class="n">build</span><span class="p">()</span>
+<a href="../../reference/api/python/micro.html#tvm.micro.GeneratedProject" title="tvm.micro.GeneratedProject" class="sphx-glr-backref-module-tvm-micro sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">project</span></a><span class="o">.</span><span class="n">flash</span><span class="p">()</span>
+
+<a href="https://docs.python.org/3/library/stdtypes.html#list" title="builtins.list" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">labels</span></a> <span class="o">=</span> <span class="p">[</span>
+    <span class="s2">&quot;_silence_&quot;</span><span class="p">,</span>
+    <span class="s2">&quot;_unknown_&quot;</span><span class="p">,</span>
+    <span class="s2">&quot;yes&quot;</span><span class="p">,</span>
+    <span class="s2">&quot;no&quot;</span><span class="p">,</span>
+    <span class="s2">&quot;up&quot;</span><span class="p">,</span>
+    <span class="s2">&quot;down&quot;</span><span class="p">,</span>
+    <span class="s2">&quot;left&quot;</span><span class="p">,</span>
+    <span class="s2">&quot;right&quot;</span><span class="p">,</span>
+    <span class="s2">&quot;on&quot;</span><span class="p">,</span>
+    <span class="s2">&quot;off&quot;</span><span class="p">,</span>
+    <span class="s2">&quot;stop&quot;</span><span class="p">,</span>
+    <span class="s2">&quot;go&quot;</span><span class="p">,</span>
+<span class="p">]</span>
+<span class="k">with</span> <a href="../../reference/api/python/micro.html#tvm.micro.Session" title="tvm.micro.Session" class="sphx-glr-backref-module-tvm-micro sphx-glr-backref-type-py-class"><span class="n">tvm</span><span class="o">.</span><span class="n">micro</span><span class="o">.</span><span class="n">Session</span></a><span class="p">(</span><a href="../../reference/api/python/micro.html#tvm.micro.GeneratedProject" title="tvm.micro.GeneratedProject" class="sphx-glr-backref-modul [...]
+    <span class="n">aot_executor</span> <span class="o">=</span> <span class="n">tvm</span><span class="o">.</span><span class="n">runtime</span><span class="o">.</span><span class="n">executor</span><span class="o">.</span><span class="n">aot_executor</span><span class="o">.</span><span class="n">AotModule</span><span class="p">(</span><a href="../../reference/api/python/micro.html#tvm.micro.Session" title="tvm.micro.Session" class="sphx-glr-backref-module-tvm-micro sphx-glr-backref-typ [...]
+    <span class="n">sample</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">SAMPLE_PATH</span></a><span class="p">)</span>
+    <span class="n">aot_executor</span><span class="o">.</span><span class="n">get_input</span><span class="p">(</span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">INPUT_NAME</span></a><span class="p">)</span><span class="o">.</span><span class="n">copyfrom</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
+    <span class="n">aot_executor</span><span class="o">.</span><span class="n">run</span><span class="p">()</span>
+    <span class="n">result</span> <span class="o">=</span> <span class="n">aot_executor</span><span class="o">.</span><span class="n">get_output</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">numpy</span><span class="p">()</span>
+    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Label is `</span><span class="si">{</span><a href="https://docs.python.org/3/library/stdtypes.html#list" title="builtins.list" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">labels</span></a><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><sp [...]
+<span class="c1">#</span>
+<span class="c1"># Output:</span>
+<span class="c1"># Label is `left` with index `6`</span>
+<span class="c1">#</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Label is `left` with index `6`
+</pre></div>
+</div>
+<div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-work-with-microtvm-micro-aot-py">
+<div class="sphx-glr-download sphx-glr-download-python docutils container">
+<p><a class="reference download internal" download="" href="../../_downloads/f8a7209a0e66b246185bfc41bbc82f54/micro_aot.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">micro_aot.py</span></code></a></p>
+</div>
+<div class="sphx-glr-download sphx-glr-download-jupyter docutils container">
+<p><a class="reference download internal" download="" href="../../_downloads/c00933f3fbcf90c4f584d54607b33805/micro_aot.ipynb"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Jupyter</span> <span class="pre">notebook:</span> <span class="pre">micro_aot.ipynb</span></code></a></p>
+</div>
+</div>
+<p class="sphx-glr-signature"><a class="reference external" href="https://sphinx-gallery.github.io">Gallery generated by Sphinx-Gallery</a></p>
+</div>
+</div>
+
+
+           </div>
+           
+          </div>
+          
+
+<footer>
+
+    <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
+      
+        <a href="micro_autotune.html" class="btn btn-neutral float-right" title="Autotuning with microTVM" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
+      
+      
+        <a href="index.html" class="btn btn-neutral float-left" title="Work With microTVM" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
+      
+    </div>
+
+<div id="button" class="backtop"><img src="../../_static/img/right.svg" alt="backtop"/> </div>
+<section class="footerSec">
+    <div class="footerHeader">
+      <div class="d-flex align-md-items-center justify-content-between flex-column flex-md-row">
+        <div class="copywrite d-flex align-items-center">
+          <h5 id="copy-right-info">© 2022 Apache Software Foundation | All rights reserved</h5>
+        </div>
+      </div>
+
+    </div>
+
+    <div>
+      <div class="footernote">Copyright © 2022 The Apache Software Foundation. Apache TVM, Apache, the Apache feather, and the Apache TVM project logo are either trademarks or registered trademarks of the Apache Software Foundation.</div>
+    </div>
+
+</section>
+</footer>
+        </div>
+      </div>
+
+    </section>
+
+  </div>
+  
+
+    <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js" integrity="sha384-ApNbgh9B+Y1QKtv3Rn7W3mgPxhU9K/ScQsAP7hUibX39j7fakFPskvXusvfa0b4Q" crossorigin="anonymous"></script>
+    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js" integrity="sha384-JZR6Spejh4U02d8jOt6vLEHfe/JQGiRRSQQxSfFWpi1MquVdAyjUar5+76PVCmYl" crossorigin="anonymous"></script>
+
+  </body>
+  <script type="text/javascript">
+      jQuery(function () {
+          SphinxRtdTheme.Navigation.enable(true);
+      });
+  </script>
+
+  
+  
+    
+    <!-- Theme Analytics -->
+    <script>
+    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+      m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+    })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
+
+    ga('create', 'UA-75982049-2', 'auto');
+    ga('send', 'pageview');
+    </script>
+
+    
+   
+
+</body>
+</html>
\ No newline at end of file
diff --git a/docs/how_to/work_with_microtvm/micro_autotune.html b/docs/how_to/work_with_microtvm/micro_autotune.html
index 6c01bae4d..5be958874 100644
--- a/docs/how_to/work_with_microtvm/micro_autotune.html
+++ b/docs/how_to/work_with_microtvm/micro_autotune.html
@@ -49,7 +49,7 @@
     <link rel="index" title="Index" href="../../genindex.html" />
     <link rel="search" title="Search" href="../../search.html" />
     <link rel="next" title="Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN" href="micro_ethosu.html" />
-    <link rel="prev" title="Work With microTVM" href="index.html" /> 
+    <link rel="prev" title="microTVM Host-Driven AoT" href="micro_aot.html" /> 
 </head>
 
 <body class="wy-body-for-nav">
@@ -238,6 +238,7 @@
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autotvm/index.html">Auto-Tune with Templates and AutoTVM</a></li>
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autoscheduler/index.html">Use AutoScheduler for Template-Free Scheduling</a></li>
 <li class="toctree-l2 current"><a class="reference internal" href="index.html">Work With microTVM</a><ul class="current">
+<li class="toctree-l3"><a class="reference internal" href="micro_aot.html">microTVM Host-Driven AoT</a></li>
 <li class="toctree-l3 current"><a class="current reference internal" href="#">Autotuning with microTVM</a><ul>
 <li class="toctree-l4"><a class="reference internal" href="#defining-the-model">Defining the model</a></li>
 <li class="toctree-l4"><a class="reference internal" href="#defining-the-target">Defining the target</a></li>
@@ -583,10 +584,10 @@ the tuned operator.</p>
 ########## Build without Autotuning ##########
 Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)
 ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------
-tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  311.2     98.678   (1, 2, 10, 10, 3)  2       1        [311.2]
-tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.056     0.969    (1, 6, 10, 10)     1       1        [3.056]
-tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         1.112     0.353    (1, 1, 10, 10, 3)  1       1        [1.112]
-Total_time                                    -                                             315.369   -        -                  -       -        -
+tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  311.1     98.724   (1, 2, 10, 10, 3)  2       1        [311.1]
+tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.049     0.967    (1, 6, 10, 10)     1       1        [3.049]
+tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.972     0.308    (1, 1, 10, 10, 3)  1       1        [0.972]
+Total_time                                    -                                             315.12    -        -                  -       -        -
 </pre></div>
 </div>
 </div>
@@ -639,10 +640,10 @@ Total_time                                    -
 ########## Build with Autotuning ##########
 Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)
 ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------
-tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  249.2     98.828   (1, 1, 10, 10, 6)  2       1        [249.2]
-tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.987     0.788    (1, 6, 10, 10)     1       1        [1.987]
-tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.969     0.384    (1, 1, 10, 10, 3)  1       1        [0.969]
-Total_time                                    -                                             252.156   -        -                  -       -        -
+tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  149.8     98.167   (1, 6, 10, 10, 1)  2       1        [149.8]
+tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.815     1.189    (1, 6, 10, 10)     1       1        [1.815]
+tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.982     0.644    (1, 1, 10, 10, 3)  1       1        [0.982]
+Total_time                                    -                                             152.597   -        -                  -       -        -
 </pre></div>
 </div>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-work-with-microtvm-micro-autotune-py">
@@ -670,7 +671,7 @@ Total_time                                    -
         <a href="micro_ethosu.html" class="btn btn-neutral float-right" title="Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
       
       
-        <a href="index.html" class="btn btn-neutral float-left" title="Work With microTVM" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
+        <a href="micro_aot.html" class="btn btn-neutral float-left" title="microTVM Host-Driven AoT" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
       
     </div>
 
diff --git a/docs/how_to/work_with_microtvm/micro_ethosu.html b/docs/how_to/work_with_microtvm/micro_ethosu.html
index e0caf7712..7e947531b 100644
--- a/docs/how_to/work_with_microtvm/micro_ethosu.html
+++ b/docs/how_to/work_with_microtvm/micro_ethosu.html
@@ -238,6 +238,7 @@
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autotvm/index.html">Auto-Tune with Templates and AutoTVM</a></li>
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autoscheduler/index.html">Use AutoScheduler for Template-Free Scheduling</a></li>
 <li class="toctree-l2 current"><a class="reference internal" href="index.html">Work With microTVM</a><ul class="current">
+<li class="toctree-l3"><a class="reference internal" href="micro_aot.html">microTVM Host-Driven AoT</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_autotune.html">Autotuning with microTVM</a></li>
 <li class="toctree-l3 current"><a class="current reference internal" href="#">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</a><ul>
 <li class="toctree-l4"><a class="reference internal" href="#obtaining-tvm">Obtaining TVM</a></li>
diff --git a/docs/how_to/work_with_microtvm/micro_reference_vm.html b/docs/how_to/work_with_microtvm/micro_reference_vm.html
index ef3de312d..bd375d1e5 100644
--- a/docs/how_to/work_with_microtvm/micro_reference_vm.html
+++ b/docs/how_to/work_with_microtvm/micro_reference_vm.html
@@ -238,6 +238,7 @@
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autotvm/index.html">Auto-Tune with Templates and AutoTVM</a></li>
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autoscheduler/index.html">Use AutoScheduler for Template-Free Scheduling</a></li>
 <li class="toctree-l2 current"><a class="reference internal" href="index.html">Work With microTVM</a><ul class="current">
+<li class="toctree-l3"><a class="reference internal" href="micro_aot.html">microTVM Host-Driven AoT</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_autotune.html">Autotuning with microTVM</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_ethosu.html">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</a></li>
 <li class="toctree-l3 current"><a class="current reference internal" href="#">microTVM Reference Virtual Machines</a><ul>
diff --git a/docs/how_to/work_with_microtvm/micro_tflite.html b/docs/how_to/work_with_microtvm/micro_tflite.html
index f0bfeeca0..59a7103e8 100644
--- a/docs/how_to/work_with_microtvm/micro_tflite.html
+++ b/docs/how_to/work_with_microtvm/micro_tflite.html
@@ -238,6 +238,7 @@
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autotvm/index.html">Auto-Tune with Templates and AutoTVM</a></li>
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autoscheduler/index.html">Use AutoScheduler for Template-Free Scheduling</a></li>
 <li class="toctree-l2 current"><a class="reference internal" href="index.html">Work With microTVM</a><ul class="current">
+<li class="toctree-l3"><a class="reference internal" href="micro_aot.html">microTVM Host-Driven AoT</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_autotune.html">Autotuning with microTVM</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_ethosu.html">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_reference_vm.html">microTVM Reference Virtual Machines</a></li>
diff --git a/docs/how_to/work_with_microtvm/micro_train.html b/docs/how_to/work_with_microtvm/micro_train.html
index 4839adad4..3a77f461d 100644
--- a/docs/how_to/work_with_microtvm/micro_train.html
+++ b/docs/how_to/work_with_microtvm/micro_train.html
@@ -238,6 +238,7 @@
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autotvm/index.html">Auto-Tune with Templates and AutoTVM</a></li>
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autoscheduler/index.html">Use AutoScheduler for Template-Free Scheduling</a></li>
 <li class="toctree-l2 current"><a class="reference internal" href="index.html">Work With microTVM</a><ul class="current">
+<li class="toctree-l3"><a class="reference internal" href="micro_aot.html">microTVM Host-Driven AoT</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_autotune.html">Autotuning with microTVM</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_ethosu.html">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_reference_vm.html">microTVM Reference Virtual Machines</a></li>
@@ -515,7 +516,7 @@ take about <strong>2 minutes</strong> to download the Stanford Cars, while COCO
 <a href="https://docs.python.org/3/library/shutil.html#shutil.move" title="shutil.move" class="sphx-glr-backref-module-shutil sphx-glr-backref-type-py-function"><span class="n">shutil</span><span class="o">.</span><span class="n">move</span></a><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str" class="sphx-glr-backref-module-builtins sphx-glr-backref-typ [...]
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>&#39;/tmp/tmp56hx7b7g/images/random&#39;
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>&#39;/tmp/tmpkmqjaw7b/images/random&#39;
 </pre></div>
 </div>
 </div>
@@ -575,8 +576,8 @@ objects to other stuff? We can display some examples from our datasets using <co
     <span class="n">plt</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s2">&quot;off&quot;</span><span class="p">)</span>
 </pre></div>
 </div>
-<img src="../../_images/sphx_glr_micro_train_001.png" srcset="../../_images/sphx_glr_micro_train_001.png" alt="[1.0, 0.0], [1.0, 0.0], [1.0, 0.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0]" class = "sphx-glr-single-img"/><div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/tmp/tmp56hx7b7g/images/target contains 8144 images
-/tmp/tmp56hx7b7g/images/random contains 5000 images
+<img src="../../_images/sphx_glr_micro_train_001.png" srcset="../../_images/sphx_glr_micro_train_001.png" alt="[1.0, 0.0], [1.0, 0.0], [1.0, 0.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0]" class = "sphx-glr-single-img"/><div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/tmp/tmpkmqjaw7b/images/target contains 8144 images
+/tmp/tmpkmqjaw7b/images/random contains 5000 images
 </pre></div>
 </div>
 </div>
@@ -688,13 +689,13 @@ the time on our validation set).</p>
 </pre></div>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Epoch 1/3
-328/328 - 55s - loss: 0.2052 - accuracy: 0.9269 - val_loss: 0.1312 - val_accuracy: 0.9581
+328/328 - 56s - loss: 0.2187 - accuracy: 0.9223 - val_loss: 0.1407 - val_accuracy: 0.9581
 Epoch 2/3
-328/328 - 52s - loss: 0.0962 - accuracy: 0.9640 - val_loss: 0.1109 - val_accuracy: 0.9637
+328/328 - 53s - loss: 0.0970 - accuracy: 0.9618 - val_loss: 0.1128 - val_accuracy: 0.9637
 Epoch 3/3
-328/328 - 52s - loss: 0.0615 - accuracy: 0.9768 - val_loss: 0.1065 - val_accuracy: 0.9645
+328/328 - 52s - loss: 0.0658 - accuracy: 0.9754 - val_loss: 0.0967 - val_accuracy: 0.9679
 
-&lt;keras.callbacks.History object at 0x7f76f644c450&gt;
+&lt;keras.callbacks.History object at 0x7f86f8594d90&gt;
 </pre></div>
 </div>
 </div>
@@ -956,7 +957,7 @@ as intended.</p>
 <p>From here, we could modify the model to read live images from the camera - we have another
 Arduino tutorial for how to do that <a class="reference external" href="https://github.com/guberti/tvm-arduino-demos/tree/master/examples/person_detection">on GitHub</a>. Alternatively, we could also
 <a class="reference external" href="https://tvm.apache.org/docs/how_to/work_with_microtvm/micro_autotune.html">use TVM’s autotuning capabilities</a> to dramatically improve the model’s performance.</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 5 minutes  8.568 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 5 minutes  10.753 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-how-to-work-with-microtvm-micro-train-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/b52cec46baf4f78d6bcd94cbe269c8a6/micro_train.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">micro_train.py</span></code></a></p>
diff --git a/docs/how_to/work_with_microtvm/micro_tvmc.html b/docs/how_to/work_with_microtvm/micro_tvmc.html
index e44310456..7497074ba 100644
--- a/docs/how_to/work_with_microtvm/micro_tvmc.html
+++ b/docs/how_to/work_with_microtvm/micro_tvmc.html
@@ -238,6 +238,7 @@
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autotvm/index.html">Auto-Tune with Templates and AutoTVM</a></li>
 <li class="toctree-l2"><a class="reference internal" href="../tune_with_autoscheduler/index.html">Use AutoScheduler for Template-Free Scheduling</a></li>
 <li class="toctree-l2 current"><a class="reference internal" href="index.html">Work With microTVM</a><ul class="current">
+<li class="toctree-l3"><a class="reference internal" href="micro_aot.html">microTVM Host-Driven AoT</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_autotune.html">Autotuning with microTVM</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_ethosu.html">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</a></li>
 <li class="toctree-l3"><a class="reference internal" href="micro_reference_vm.html">microTVM Reference Virtual Machines</a></li>
diff --git a/docs/how_to/work_with_microtvm/sg_execution_times.html b/docs/how_to/work_with_microtvm/sg_execution_times.html
index 58c45c1c4..d312e4b20 100644
--- a/docs/how_to/work_with_microtvm/sg_execution_times.html
+++ b/docs/how_to/work_with_microtvm/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-work-with-microtvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>05:53.166</strong> total execution time for <strong>how_to_work_with_microtvm</strong> files:</p>
+<p><strong>06:05.900</strong> total execution time for <strong>how_to_work_with_microtvm</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 83%" />
@@ -336,26 +336,30 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="micro_train.html#sphx-glr-how-to-work-with-microtvm-micro-train-py"><span class="std std-ref">Training Vision Models for microTVM on Arduino</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_train.py</span></code>)</p></td>
-<td><p>05:08.568</p></td>
+<td><p>05:10.753</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="micro_autotune.html#sphx-glr-how-to-work-with-microtvm-micro-autotune-py"><span class="std std-ref">Autotuning with microTVM</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_autotune.py</span></code>)</p></td>
-<td><p>00:41.466</p></td>
+<td><p>00:43.431</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="micro_tflite.html#sphx-glr-how-to-work-with-microtvm-micro-tflite-py"><span class="std std-ref">microTVM with TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tflite.py</span></code>)</p></td>
-<td><p>00:03.131</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="micro_aot.html#sphx-glr-how-to-work-with-microtvm-micro-aot-py"><span class="std std-ref">microTVM Host-Driven AoT</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_aot.py</span></code>)</p></td>
+<td><p>00:08.269</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="micro_ethosu.html#sphx-glr-how-to-work-with-microtvm-micro-ethosu-py"><span class="std std-ref">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_ethosu.py</span></code>)</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="micro_tflite.html#sphx-glr-how-to-work-with-microtvm-micro-tflite-py"><span class="std std-ref">microTVM with TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tflite.py</span></code>)</p></td>
+<td><p>00:03.446</p></td>
+<td><p>0.0 MB</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="micro_ethosu.html#sphx-glr-how-to-work-with-microtvm-micro-ethosu-py"><span class="std std-ref">Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_ethosu.py</span></code>)</p></td>
 <td><p>00:00.001</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="micro_reference_vm.html#sphx-glr-how-to-work-with-microtvm-micro-reference-vm-py"><span class="std std-ref">microTVM Reference Virtual Machines</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_reference_vm.py</span></code>)</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="micro_reference_vm.html#sphx-glr-how-to-work-with-microtvm-micro-reference-vm-py"><span class="std std-ref">microTVM Reference Virtual Machines</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_reference_vm.py</span></code>)</p></td>
 <td><p>00:00.001</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="micro_tvmc.html#sphx-glr-how-to-work-with-microtvm-micro-tvmc-py"><span class="std std-ref">Executing a Tiny Model with TVMC Micro</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tvmc.py</span></code>)</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="micro_tvmc.html#sphx-glr-how-to-work-with-microtvm-micro-tvmc-py"><span class="std std-ref">Executing a Tiny Model with TVMC Micro</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tvmc.py</span></code>)</p></td>
 <td><p>00:00.000</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
diff --git a/docs/how_to/work_with_relay/sg_execution_times.html b/docs/how_to/work_with_relay/sg_execution_times.html
index 1b62ad3ed..139213503 100644
--- a/docs/how_to/work_with_relay/sg_execution_times.html
+++ b/docs/how_to/work_with_relay/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-work-with-relay-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:37.909</strong> total execution time for <strong>how_to_work_with_relay</strong> files:</p>
+<p><strong>00:38.828</strong> total execution time for <strong>how_to_work_with_relay</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 84%" />
@@ -336,15 +336,15 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="using_pipeline_executor.html#sphx-glr-how-to-work-with-relay-using-pipeline-executor-py"><span class="std std-ref">Using Pipeline Executor in Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_pipeline_executor.py</span></code>)</p></td>
-<td><p>00:29.025</p></td>
+<td><p>00:31.358</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="using_external_lib.html#sphx-glr-how-to-work-with-relay-using-external-lib-py"><span class="std std-ref">Using External Libraries in Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_external_lib.py</span></code>)</p></td>
-<td><p>00:07.643</p></td>
+<td><p>00:05.886</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="build_gcn.html#sphx-glr-how-to-work-with-relay-build-gcn-py"><span class="std std-ref">Building a Graph Convolutional Network</span></a> (<code class="docutils literal notranslate"><span class="pre">build_gcn.py</span></code>)</p></td>
-<td><p>00:01.234</p></td>
+<td><p>00:01.577</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="using_relay_viz.html#sphx-glr-how-to-work-with-relay-using-relay-viz-py"><span class="std std-ref">Use Relay Visualizer to Visualize Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_relay_viz.py</span></code>)</p></td>
diff --git a/docs/how_to/work_with_schedules/intrin_math.html b/docs/how_to/work_with_schedules/intrin_math.html
index 6208e2594..d888a88d0 100644
--- a/docs/how_to/work_with_schedules/intrin_math.html
+++ b/docs/how_to/work_with_schedules/intrin_math.html
@@ -522,7 +522,7 @@ The following example customizes CUDA lowering rule for <code class="code docuti
 <a href="../../reference/api/python/ir.html#tvm.ir.register_intrin_lowering" title="tvm.ir.register_intrin_lowering" class="sphx-glr-backref-module-tvm-ir sphx-glr-backref-type-py-function"><span class="n">register_intrin_lowering</span></a><span class="p">(</span><span class="s2">&quot;tir.exp&quot;</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="s2">&quot;cuda&quot;</span><span class="p">,</span> <span class="n">f</span><span class="o">= [...]
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>&lt;function my_cuda_math_rule at 0x7f76f63c03b0&gt;
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>&lt;function my_cuda_math_rule at 0x7f8679293f80&gt;
 </pre></div>
 </div>
 <p>Register the rule to TVM with override option to override existing rule.
diff --git a/docs/how_to/work_with_schedules/sg_execution_times.html b/docs/how_to/work_with_schedules/sg_execution_times.html
index 08c5ba8cd..e488d076b 100644
--- a/docs/how_to/work_with_schedules/sg_execution_times.html
+++ b/docs/how_to/work_with_schedules/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-how-to-work-with-schedules-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:03.756</strong> total execution time for <strong>how_to_work_with_schedules</strong> files:</p>
+<p><strong>00:04.455</strong> total execution time for <strong>how_to_work_with_schedules</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 83%" />
@@ -336,35 +336,35 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="intrin_math.html#sphx-glr-how-to-work-with-schedules-intrin-math-py"><span class="std std-ref">Intrinsics and Math Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">intrin_math.py</span></code>)</p></td>
-<td><p>00:01.711</p></td>
+<td><p>00:01.998</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tensorize.html#sphx-glr-how-to-work-with-schedules-tensorize-py"><span class="std std-ref">Use Tensorize to Leverage Hardware Intrinsics</span></a> (<code class="docutils literal notranslate"><span class="pre">tensorize.py</span></code>)</p></td>
-<td><p>00:00.916</p></td>
+<td><p>00:01.175</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="reduction.html#sphx-glr-how-to-work-with-schedules-reduction-py"><span class="std std-ref">Reduction</span></a> (<code class="docutils literal notranslate"><span class="pre">reduction.py</span></code>)</p></td>
-<td><p>00:00.484</p></td>
+<td><p>00:00.557</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="scan.html#sphx-glr-how-to-work-with-schedules-scan-py"><span class="std std-ref">Scan and Recurrent Kernel</span></a> (<code class="docutils literal notranslate"><span class="pre">scan.py</span></code>)</p></td>
-<td><p>00:00.468</p></td>
+<td><p>00:00.541</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="extern_op.html#sphx-glr-how-to-work-with-schedules-extern-op-py"><span class="std std-ref">External Tensor Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">extern_op.py</span></code>)</p></td>
-<td><p>00:00.097</p></td>
+<td><p>00:00.102</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="schedule_primitives.html#sphx-glr-how-to-work-with-schedules-schedule-primitives-py"><span class="std std-ref">Schedule Primitives in TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">schedule_primitives.py</span></code>)</p></td>
-<td><p>00:00.040</p></td>
+<td><p>00:00.041</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="tedd.html#sphx-glr-how-to-work-with-schedules-tedd-py"><span class="std std-ref">Use Tensor Expression Debug Display (TEDD) for Visualization</span></a> (<code class="docutils literal notranslate"><span class="pre">tedd.py</span></code>)</p></td>
-<td><p>00:00.026</p></td>
+<td><p>00:00.027</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tuple_inputs.html#sphx-glr-how-to-work-with-schedules-tuple-inputs-py"><span class="std std-ref">Compute and Reduce with Tuple Inputs</span></a> (<code class="docutils literal notranslate"><span class="pre">tuple_inputs.py</span></code>)</p></td>
-<td><p>00:00.014</p></td>
+<td><p>00:00.015</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/how_to/work_with_schedules/tensorize.html b/docs/how_to/work_with_schedules/tensorize.html
index e9b240d81..1b9994d56 100644
--- a/docs/how_to/work_with_schedules/tensorize.html
+++ b/docs/how_to/work_with_schedules/tensorize.html
@@ -577,7 +577,7 @@ The importing needs to happen before the tensorized GEMV being executed.</p>
              C: Buffer(C_2: Pointer(float32), float32, [524288], [])}
   buffer_map = {A_1: A, B_1: B, C_1: C}
   preflattened_buffer_map = {A_1: A_3: Buffer(A_2, float32, [1024, 64], []), B_1: B_3: Buffer(B_2, float32, [512, 64], []), C_1: C_3: Buffer(C_2, float32, [1024, 512], [])} {
-  attr [IterVar(i: int32, (nullptr), &quot;DataPar&quot;, &quot;&quot;)] &quot;pragma_import_llvm&quot; = &quot;; ModuleID = &#39;/tmp/tmp2_2mr7z0/input0.cc&#39;\nsource_filename = \&quot;/tmp/tmp2_2mr7z0/input0.cc\&quot;\ntarget datalayout = \&quot;e-m:e-i64:64-f80:128-n8:16:32:64-S128\&quot;\ntarget triple = \&quot;x86_64-pc-linux-gnu\&quot;\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = allo [...]
+  attr [IterVar(i: int32, (nullptr), &quot;DataPar&quot;, &quot;&quot;)] &quot;pragma_import_llvm&quot; = &quot;; ModuleID = &#39;/tmp/tmphxqkmtky/input0.cc&#39;\nsource_filename = \&quot;/tmp/tmphxqkmtky/input0.cc\&quot;\ntarget datalayout = \&quot;e-m:e-i64:64-f80:128-n8:16:32:64-S128\&quot;\ntarget triple = \&quot;x86_64-pc-linux-gnu\&quot;\n\n; Function Attrs: noinline nounwind optnone uwtable\ndefine dso_local i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {\n  %7 = allo [...]
   for (i, 0, 1024) {
     for (j.outer: int32, 0, 32) {
       @tir.call_extern(&quot;gemv_update&quot;, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), C_2, ((i*512) + (j.outer*16)), 16, 2, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), A_2, (i*64), 64, 1, dtype=handle), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=float32), B_2, (j.outer*1024), 1024, 1, dtype=handle), 16, 64, 64, dtype=int32)
diff --git a/docs/objects.inv b/docs/objects.inv
index d2cd1dbde..7d85c98ab 100644
Binary files a/docs/objects.inv and b/docs/objects.inv differ
diff --git a/docs/reference/api/python/auto_scheduler.html b/docs/reference/api/python/auto_scheduler.html
index 6cd3f74b2..bca168c23 100644
--- a/docs/reference/api/python/auto_scheduler.html
+++ b/docs/reference/api/python/auto_scheduler.html
@@ -1602,7 +1602,7 @@ history states as starting point to perform Evolutionary Search).</p></li>
 
 <dl class="py class">
 <dt class="sig sig-object py" id="tvm.auto_scheduler.SketchPolicy">
-<em class="property"><span class="pre">class</span> </em><span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">SketchPolicy</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">program_cost_model</span></span><span class="o"><span class="pre">=</span></span><span class="defau [...]
+<em class="property"><span class="pre">class</span> </em><span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">SketchPolicy</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">program_cost_model</span></span><span class="o"><span class="pre">=</span></span><span class="defau [...]
 <dd><p>The search policy that searches in a hierarchical search space defined by sketches.
 The policy randomly samples programs from the space defined by sketches and use evolutionary
 search to fine-tune them.</p>
@@ -1886,7 +1886,7 @@ Candidates:
 
 <dl class="py function">
 <dt class="sig sig-object py" id="tvm.auto_scheduler.auto_schedule">
-<span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">auto_schedule</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">search_policy</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em clas [...]
+<span class="sig-prename descclassname"><span class="pre">tvm.auto_scheduler.</span></span><span class="sig-name descname"><span class="pre">auto_schedule</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">task</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">search_policy</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em clas [...]
 <dd><p>THIS API IS DEPRECATED.</p>
 <p>Run auto scheduling search for a task.</p>
 <dl class="field-list simple">
diff --git a/docs/reference/api/typedoc/classes/bytestreamreader.html b/docs/reference/api/typedoc/classes/bytestreamreader.html
index f29277670..73bb274ae 100644
--- a/docs/reference/api/typedoc/classes/bytestreamreader.html
+++ b/docs/reference/api/typedoc/classes/bytestreamreader.html
@@ -119,7 +119,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -141,7 +141,7 @@
 					<div class="tsd-signature tsd-kind-icon">bytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Uint8Array</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -151,7 +151,7 @@
 					<div class="tsd-signature tsd-kind-icon">offset<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L42">rpc_server.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L42">rpc_server.ts:42</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -168,7 +168,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L63">rpc_server.ts:63</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L63">rpc_server.ts:63</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">Uint8Array</span></h4>
@@ -185,7 +185,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L49">rpc_server.ts:49</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L49">rpc_server.ts:49</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -202,7 +202,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L57">rpc_server.ts:57</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L57">rpc_server.ts:57</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
diff --git a/docs/reference/api/typedoc/classes/cachedcallstack.html b/docs/reference/api/typedoc/classes/cachedcallstack.html
index 3a8742e09..3a7acf30e 100644
--- a/docs/reference/api/typedoc/classes/cachedcallstack.html
+++ b/docs/reference/api/typedoc/classes/cachedcallstack.html
@@ -144,7 +144,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L223">memory.ts:223</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L223">memory.ts:223</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -172,7 +172,7 @@
 					<div class="tsd-signature tsd-kind-icon">temp<wbr>Args<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><a href="../interfaces/disposable.html" class="tsd-signature-type">Disposable</a><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = []</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L208">memory.ts:208</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L208">memory.ts:208</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -194,7 +194,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L312">memory.ts:312</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L312">memory.ts:312</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -226,7 +226,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L284">memory.ts:284</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L284">memory.ts:284</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -262,7 +262,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L388">memory.ts:388</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L388">memory.ts:388</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -300,7 +300,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L376">memory.ts:376</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L376">memory.ts:376</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -340,7 +340,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L267">memory.ts:267</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L267">memory.ts:267</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -373,7 +373,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L243">memory.ts:243</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L243">memory.ts:243</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -390,7 +390,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L321">memory.ts:321</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L321">memory.ts:321</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -422,7 +422,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L252">memory.ts:252</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L252">memory.ts:252</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -444,7 +444,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L359">memory.ts:359</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L359">memory.ts:359</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -470,7 +470,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L342">memory.ts:342</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L342">memory.ts:342</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -496,7 +496,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L350">memory.ts:350</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L350">memory.ts:350</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -522,7 +522,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L326">memory.ts:326</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L326">memory.ts:326</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -548,7 +548,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L363">memory.ts:363</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L363">memory.ts:363</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -574,7 +574,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L346">memory.ts:346</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L346">memory.ts:346</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -600,7 +600,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L334">memory.ts:334</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L334">memory.ts:334</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
diff --git a/docs/reference/api/typedoc/classes/dldatatype.html b/docs/reference/api/typedoc/classes/dldatatype.html
index 96c8b41ec..c4d1bcf8f 100644
--- a/docs/reference/api/typedoc/classes/dldatatype.html
+++ b/docs/reference/api/typedoc/classes/dldatatype.html
@@ -119,7 +119,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L262">runtime.ts:262</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L262">runtime.ts:262</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -147,7 +147,7 @@
 					<div class="tsd-signature tsd-kind-icon">bits<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L260">runtime.ts:260</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L260">runtime.ts:260</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">code<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L258">runtime.ts:258</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L258">runtime.ts:258</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -177,7 +177,7 @@
 					<div class="tsd-signature tsd-kind-icon">lanes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L262">runtime.ts:262</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L262">runtime.ts:262</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -199,7 +199,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L279">runtime.ts:279</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L279">runtime.ts:279</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -216,7 +216,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L270">runtime.ts:270</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L270">runtime.ts:270</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">string</span></h4>
diff --git a/docs/reference/api/typedoc/classes/dldevice.html b/docs/reference/api/typedoc/classes/dldevice.html
index 887d2deca..5ca3966f1 100644
--- a/docs/reference/api/typedoc/classes/dldevice.html
+++ b/docs/reference/api/typedoc/classes/dldevice.html
@@ -118,7 +118,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L202">runtime.ts:202</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L202">runtime.ts:202</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -146,7 +146,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<wbr>Id<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L200">runtime.ts:200</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L200">runtime.ts:200</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -161,7 +161,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L198">runtime.ts:198</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L198">runtime.ts:198</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -183,7 +183,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L223">runtime.ts:223</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L223">runtime.ts:223</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -205,7 +205,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L230">runtime.ts:230</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L230">runtime.ts:230</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">string</span></h4>
diff --git a/docs/reference/api/typedoc/classes/environment.html b/docs/reference/api/typedoc/classes/environment.html
index ef5b0bf95..559e4db2c 100644
--- a/docs/reference/api/typedoc/classes/environment.html
+++ b/docs/reference/api/typedoc/classes/environment.html
@@ -125,7 +125,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/environment.ts#L86">environment.ts:86</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/environment.ts#L86">environment.ts:86</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -169,7 +169,7 @@
 					<aside class="tsd-sources">
 						<p>Implementation of <a href="../interfaces/libraryprovider.html">LibraryProvider</a>.<a href="../interfaces/libraryprovider.html#imports">imports</a></p>
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/environment.ts#L70">environment.ts:70</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/environment.ts#L70">environment.ts:70</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 					<div class="tsd-signature tsd-kind-icon">logger<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>msg<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/environment.ts#L69">environment.ts:69</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/environment.ts#L69">environment.ts:69</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -210,7 +210,7 @@
 					<div class="tsd-signature tsd-kind-icon">packedCFunc<wbr>Table<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">ctypes.FTVMWasmPackedCFunc</span><span class="tsd-signature-symbol"> | </span><span class="tsd-signature-type">undefined</span><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = [undefined,]</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/environment.ts#L78">environment.ts:78</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/environment.ts#L78">environment.ts:78</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -228,7 +228,7 @@
 					<div class="tsd-signature tsd-kind-icon">packedCFunc<wbr>Table<wbr>Free<wbr>Id<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = []</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/environment.ts#L84">environment.ts:84</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/environment.ts#L84">environment.ts:84</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -250,7 +250,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/environment.ts#L105">environment.ts:105</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/environment.ts#L105">environment.ts:105</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/ffilibrary.html b/docs/reference/api/typedoc/classes/ffilibrary.html
index e02480250..341ff2194 100644
--- a/docs/reference/api/typedoc/classes/ffilibrary.html
+++ b/docs/reference/api/typedoc/classes/ffilibrary.html
@@ -131,7 +131,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L49">runtime.ts:49</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L49">runtime.ts:49</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -156,7 +156,7 @@
 					<div class="tsd-signature tsd-kind-icon">exports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">Function</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L46">runtime.ts:46</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L46">runtime.ts:46</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -166,7 +166,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L45">runtime.ts:45</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L45">runtime.ts:45</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">wasm32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">boolean</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L44">runtime.ts:44</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L44">runtime.ts:44</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -186,7 +186,7 @@
 					<div class="tsd-signature tsd-kind-icon">webGPUContext<span class="tsd-signature-symbol">:</span> <a href="webgpucontext.html" class="tsd-signature-type">WebGPUContext</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L47">runtime.ts:47</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L47">runtime.ts:47</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -203,7 +203,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L76">runtime.ts:76</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L76">runtime.ts:76</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -226,7 +226,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L66">runtime.ts:66</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L66">runtime.ts:66</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -243,7 +243,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L84">runtime.ts:84</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L84">runtime.ts:84</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <a href="cachedcallstack.html" class="tsd-signature-type">CachedCallStack</a></h4>
@@ -260,7 +260,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L95">runtime.ts:95</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L95">runtime.ts:95</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -283,7 +283,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L72">runtime.ts:72</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L72">runtime.ts:72</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
diff --git a/docs/reference/api/typedoc/classes/graphexecutor.html b/docs/reference/api/typedoc/classes/graphexecutor.html
index 11e72b6b1..e455aacd2 100644
--- a/docs/reference/api/typedoc/classes/graphexecutor.html
+++ b/docs/reference/api/typedoc/classes/graphexecutor.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L583">runtime.ts:583</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L583">runtime.ts:583</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">module<span class="tsd-signature-symbol">:</span> <a href="module.html" class="tsd-signature-type">Module</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L579">runtime.ts:579</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L579">runtime.ts:579</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L654">runtime.ts:654</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L654">runtime.ts:654</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -224,7 +224,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L597">runtime.ts:597</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L597">runtime.ts:597</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -241,7 +241,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L631">runtime.ts:631</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L631">runtime.ts:631</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -279,7 +279,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L644">runtime.ts:644</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L644">runtime.ts:644</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -310,7 +310,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L621">runtime.ts:621</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L621">runtime.ts:621</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -332,7 +332,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L609">runtime.ts:609</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L609">runtime.ts:609</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/instance.html b/docs/reference/api/typedoc/classes/instance.html
index 9e48def3c..a618dcd46 100644
--- a/docs/reference/api/typedoc/classes/instance.html
+++ b/docs/reference/api/typedoc/classes/instance.html
@@ -139,7 +139,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L692">runtime.ts:692</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L692">runtime.ts:692</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -202,7 +202,7 @@
 					<div class="tsd-signature tsd-kind-icon">exports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">Function</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L684">runtime.ts:684</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L684">runtime.ts:684</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -212,7 +212,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L683">runtime.ts:683</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L683">runtime.ts:683</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -229,7 +229,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L932">runtime.ts:932</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L932">runtime.ts:932</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -260,7 +260,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L994">runtime.ts:994</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L994">runtime.ts:994</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -303,7 +303,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L924">runtime.ts:924</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L924">runtime.ts:924</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -341,7 +341,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L732">runtime.ts:732</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L732">runtime.ts:732</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -358,7 +358,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L952">runtime.ts:952</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L952">runtime.ts:952</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -402,7 +402,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L816">runtime.ts:816</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L816">runtime.ts:816</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -434,7 +434,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L1033">runtime.ts:1033</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L1033">runtime.ts:1033</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -465,7 +465,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L846">runtime.ts:846</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L846">runtime.ts:846</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -497,7 +497,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L750">runtime.ts:750</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L750">runtime.ts:750</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -520,7 +520,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L1013">runtime.ts:1013</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L1013">runtime.ts:1013</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -568,7 +568,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L789">runtime.ts:789</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L789">runtime.ts:789</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -608,7 +608,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L914">runtime.ts:914</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L914">runtime.ts:914</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -646,7 +646,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L1140">runtime.ts:1140</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L1140">runtime.ts:1140</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -698,7 +698,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L740">runtime.ts:740</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L740">runtime.ts:740</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -722,7 +722,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L868">runtime.ts:868</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L868">runtime.ts:868</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -754,7 +754,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L857">runtime.ts:857</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L857">runtime.ts:857</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -786,7 +786,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L940">runtime.ts:940</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L940">runtime.ts:940</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/memory.html b/docs/reference/api/typedoc/classes/memory.html
index 826be23c7..cbc42849b 100644
--- a/docs/reference/api/typedoc/classes/memory.html
+++ b/docs/reference/api/typedoc/classes/memory.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L40">memory.ts:40</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L40">memory.ts:40</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -152,7 +152,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Memory</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L32">memory.ts:32</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L32">memory.ts:32</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">wasm32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">boolean</span><span class="tsd-signature-symbol"> = true</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L33">memory.ts:33</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L33">memory.ts:33</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L154">memory.ts:154</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L154">memory.ts:154</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -210,7 +210,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L90">memory.ts:90</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L90">memory.ts:90</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -233,7 +233,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L97">memory.ts:97</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L97">memory.ts:97</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -256,7 +256,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L74">memory.ts:74</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L74">memory.ts:74</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -279,7 +279,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L81">memory.ts:81</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L81">memory.ts:81</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -302,7 +302,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L104">memory.ts:104</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L104">memory.ts:104</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -325,7 +325,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L132">memory.ts:132</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L132">memory.ts:132</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -362,7 +362,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L145">memory.ts:145</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L145">memory.ts:145</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -393,7 +393,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L60">memory.ts:60</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L60">memory.ts:60</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -416,7 +416,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L67">memory.ts:67</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L67">memory.ts:67</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -439,7 +439,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L53">memory.ts:53</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L53">memory.ts:53</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -462,7 +462,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L114">memory.ts:114</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L114">memory.ts:114</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -485,7 +485,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L124">memory.ts:124</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L124">memory.ts:124</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -502,7 +502,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/memory.ts#L175">memory.ts:175</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/memory.ts#L175">memory.ts:175</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/module.html b/docs/reference/api/typedoc/classes/module.html
index dc7226d05..b1cee0b8a 100644
--- a/docs/reference/api/typedoc/classes/module.html
+++ b/docs/reference/api/typedoc/classes/module.html
@@ -124,7 +124,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L504">runtime.ts:504</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L504">runtime.ts:504</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -170,7 +170,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L502">runtime.ts:502</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L502">runtime.ts:502</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -187,7 +187,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L516">runtime.ts:516</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L516">runtime.ts:516</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -204,7 +204,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L530">runtime.ts:530</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L530">runtime.ts:530</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -236,7 +236,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L561">runtime.ts:561</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L561">runtime.ts:561</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/ndarray.html b/docs/reference/api/typedoc/classes/ndarray.html
index 30436a3ac..607325d5e 100644
--- a/docs/reference/api/typedoc/classes/ndarray.html
+++ b/docs/reference/api/typedoc/classes/ndarray.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L304">runtime.ts:304</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L304">runtime.ts:304</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -158,7 +158,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<span class="tsd-signature-symbol">:</span> <a href="dldevice.html" class="tsd-signature-type">DLDevice</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L297">runtime.ts:297</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L297">runtime.ts:297</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -173,7 +173,7 @@
 					<div class="tsd-signature tsd-kind-icon">dtype<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L293">runtime.ts:293</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L293">runtime.ts:293</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -188,7 +188,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L289">runtime.ts:289</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L289">runtime.ts:289</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -203,7 +203,7 @@
 					<div class="tsd-signature tsd-kind-icon">ndim<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L291">runtime.ts:291</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L291">runtime.ts:291</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -218,7 +218,7 @@
 					<div class="tsd-signature tsd-kind-icon">shape<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L295">runtime.ts:295</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L295">runtime.ts:295</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -240,7 +240,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L370">runtime.ts:370</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L370">runtime.ts:370</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -273,7 +273,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L414">runtime.ts:414</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L414">runtime.ts:414</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -305,7 +305,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L355">runtime.ts:355</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L355">runtime.ts:355</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -322,7 +322,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L474">runtime.ts:474</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L474">runtime.ts:474</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -346,7 +346,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L443">runtime.ts:443</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L443">runtime.ts:443</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/packedfunccell.html b/docs/reference/api/typedoc/classes/packedfunccell.html
index 473ca8edf..2e55173f1 100644
--- a/docs/reference/api/typedoc/classes/packedfunccell.html
+++ b/docs/reference/api/typedoc/classes/packedfunccell.html
@@ -122,7 +122,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L158">runtime.ts:158</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L158">runtime.ts:158</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -147,7 +147,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L157">runtime.ts:157</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L157">runtime.ts:157</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -164,7 +164,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L165">runtime.ts:165</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L165">runtime.ts:165</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
diff --git a/docs/reference/api/typedoc/classes/rpcserver.html b/docs/reference/api/typedoc/classes/rpcserver.html
index 9f5f14a21..5ce012d5f 100644
--- a/docs/reference/api/typedoc/classes/rpcserver.html
+++ b/docs/reference/api/typedoc/classes/rpcserver.html
@@ -115,7 +115,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L92">rpc_server.ts:92</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L92">rpc_server.ts:92</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">get<wbr>Imports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">unknown</span><span class="tsd-signat [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L82">rpc_server.ts:82</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L82">rpc_server.ts:82</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -201,7 +201,7 @@
 					<div class="tsd-signature tsd-kind-icon">key<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L78">rpc_server.ts:78</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L78">rpc_server.ts:78</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -211,7 +211,7 @@
 					<div class="tsd-signature tsd-kind-icon">logger<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>msg<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L81">rpc_server.ts:81</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L81">rpc_server.ts:81</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -242,7 +242,7 @@
 					<div class="tsd-signature tsd-kind-icon">socket<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">WebSocket</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L79">rpc_server.ts:79</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L79">rpc_server.ts:79</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -252,7 +252,7 @@
 					<div class="tsd-signature tsd-kind-icon">state<span class="tsd-signature-symbol">:</span> <a href="../enums/rpcserverstate.html" class="tsd-signature-type">RPCServerState</a><span class="tsd-signature-symbol"> = RPCServerState.InitHeader</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L80">rpc_server.ts:80</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L80">rpc_server.ts:80</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -262,7 +262,7 @@
 					<div class="tsd-signature tsd-kind-icon">url<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L77">rpc_server.ts:77</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L77">rpc_server.ts:77</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/classes/scalar.html b/docs/reference/api/typedoc/classes/scalar.html
index aa5996042..063561abc 100644
--- a/docs/reference/api/typedoc/classes/scalar.html
+++ b/docs/reference/api/typedoc/classes/scalar.html
@@ -112,7 +112,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L145">runtime.ts:145</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L145">runtime.ts:145</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -137,7 +137,7 @@
 					<div class="tsd-signature tsd-kind-icon">dtype<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L145">runtime.ts:145</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L145">runtime.ts:145</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -152,7 +152,7 @@
 					<div class="tsd-signature tsd-kind-icon">value<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L143">runtime.ts:143</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L143">runtime.ts:143</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/classes/webgpucontext.html b/docs/reference/api/typedoc/classes/webgpucontext.html
index cdf736507..7c746261a 100644
--- a/docs/reference/api/typedoc/classes/webgpucontext.html
+++ b/docs/reference/api/typedoc/classes/webgpucontext.html
@@ -120,7 +120,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/webgpu.ts#L57">webgpu.ts:57</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/webgpu.ts#L57">webgpu.ts:57</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -145,7 +145,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">GPUDevice</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/webgpu.ts#L50">webgpu.ts:50</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/webgpu.ts#L50">webgpu.ts:50</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -155,7 +155,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/webgpu.ts#L51">webgpu.ts:51</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/webgpu.ts#L51">webgpu.ts:51</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -172,7 +172,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/webgpu.ts#L84">webgpu.ts:84</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/webgpu.ts#L84">webgpu.ts:84</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -209,7 +209,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/webgpu.ts#L170">webgpu.ts:170</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/webgpu.ts#L170">webgpu.ts:170</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -238,7 +238,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/webgpu.ts#L67">webgpu.ts:67</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/webgpu.ts#L67">webgpu.ts:67</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/enums/argtypecode.html b/docs/reference/api/typedoc/enums/argtypecode.html
index c9823da94..61be13dab 100644
--- a/docs/reference/api/typedoc/enums/argtypecode.html
+++ b/docs/reference/api/typedoc/enums/argtypecode.html
@@ -106,7 +106,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLDevice<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 6</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L220">ctypes.ts:220</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L220">ctypes.ts:220</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -116,7 +116,7 @@
 					<div class="tsd-signature tsd-kind-icon">Float<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L216">ctypes.ts:216</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L216">ctypes.ts:216</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -126,7 +126,7 @@
 					<div class="tsd-signature tsd-kind-icon">Int<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L214">ctypes.ts:214</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L214">ctypes.ts:214</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -136,7 +136,7 @@
 					<div class="tsd-signature tsd-kind-icon">Null<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L218">ctypes.ts:218</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L218">ctypes.ts:218</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -146,7 +146,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMBytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 12</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L226">ctypes.ts:226</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L226">ctypes.ts:226</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -156,7 +156,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMDLTensor<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 7</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L221">ctypes.ts:221</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L221">ctypes.ts:221</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -166,7 +166,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMData<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 5</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L219">ctypes.ts:219</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L219">ctypes.ts:219</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMModule<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 9</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L223">ctypes.ts:223</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L223">ctypes.ts:223</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -186,7 +186,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMNDArray<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 13</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L227">ctypes.ts:227</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L227">ctypes.ts:227</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -196,7 +196,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMObject<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L222">ctypes.ts:222</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L222">ctypes.ts:222</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -206,7 +206,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMObjectRValue<wbr>Ref<wbr>Arg<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 14</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L228">ctypes.ts:228</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L228">ctypes.ts:228</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -216,7 +216,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMOpaque<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 3</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L217">ctypes.ts:217</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L217">ctypes.ts:217</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -226,7 +226,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMPacked<wbr>Func<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 10</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L224">ctypes.ts:224</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L224">ctypes.ts:224</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -236,7 +236,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 11</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L225">ctypes.ts:225</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L225">ctypes.ts:225</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -246,7 +246,7 @@
 					<div class="tsd-signature tsd-kind-icon">UInt<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L215">ctypes.ts:215</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L215">ctypes.ts:215</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/aynccallbackcode.html b/docs/reference/api/typedoc/enums/aynccallbackcode.html
index a5f6b0c39..7bcae90cf 100644
--- a/docs/reference/api/typedoc/enums/aynccallbackcode.html
+++ b/docs/reference/api/typedoc/enums/aynccallbackcode.html
@@ -93,7 +93,7 @@
 					<div class="tsd-signature tsd-kind-icon">k<wbr>Exception<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 5</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L676">runtime.ts:676</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L676">runtime.ts:676</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -103,7 +103,7 @@
 					<div class="tsd-signature tsd-kind-icon">k<wbr>Return<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L675">runtime.ts:675</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L675">runtime.ts:675</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/dldatatypecode.html b/docs/reference/api/typedoc/enums/dldatatypecode.html
index 47896ef2c..4dbbd597e 100644
--- a/docs/reference/api/typedoc/enums/dldatatypecode.html
+++ b/docs/reference/api/typedoc/enums/dldatatypecode.html
@@ -95,7 +95,7 @@
 					<div class="tsd-signature tsd-kind-icon">Float<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L242">runtime.ts:242</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L242">runtime.ts:242</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -105,7 +105,7 @@
 					<div class="tsd-signature tsd-kind-icon">Int<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L240">runtime.ts:240</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L240">runtime.ts:240</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -115,7 +115,7 @@
 					<div class="tsd-signature tsd-kind-icon">Opaque<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 3</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L243">runtime.ts:243</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L243">runtime.ts:243</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -125,7 +125,7 @@
 					<div class="tsd-signature tsd-kind-icon">UInt<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L241">runtime.ts:241</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L241">runtime.ts:241</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/rpcserverstate.html b/docs/reference/api/typedoc/enums/rpcserverstate.html
index d0c8159ff..956cf0a04 100644
--- a/docs/reference/api/typedoc/enums/rpcserverstate.html
+++ b/docs/reference/api/typedoc/enums/rpcserverstate.html
@@ -90,7 +90,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Header<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L27">rpc_server.ts:27</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L27">rpc_server.ts:27</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -100,7 +100,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Header<wbr>Key<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L28">rpc_server.ts:28</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L28">rpc_server.ts:28</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -110,7 +110,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Server<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L29">rpc_server.ts:29</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L29">rpc_server.ts:29</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -120,7 +120,7 @@
 					<div class="tsd-signature tsd-kind-icon">Receive<wbr>Packet<wbr>Body<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L32">rpc_server.ts:32</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L32">rpc_server.ts:32</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -130,7 +130,7 @@
 					<div class="tsd-signature tsd-kind-icon">Receive<wbr>Packet<wbr>Header<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L31">rpc_server.ts:31</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L31">rpc_server.ts:31</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -140,7 +140,7 @@
 					<div class="tsd-signature tsd-kind-icon">Wait<wbr>For<wbr>Callback<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L30">rpc_server.ts:30</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L30">rpc_server.ts:30</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/enums/sizeof.html b/docs/reference/api/typedoc/enums/sizeof.html
index 1e637f733..377ccd270 100644
--- a/docs/reference/api/typedoc/enums/sizeof.html
+++ b/docs/reference/api/typedoc/enums/sizeof.html
@@ -100,7 +100,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLData<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = I32</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L206">ctypes.ts:206</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L206">ctypes.ts:206</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -110,7 +110,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLDevice<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = I32 + I32</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L207">ctypes.ts:207</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L207">ctypes.ts:207</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -120,7 +120,7 @@
 					<div class="tsd-signature tsd-kind-icon">F32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L203">ctypes.ts:203</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L203">ctypes.ts:203</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -130,7 +130,7 @@
 					<div class="tsd-signature tsd-kind-icon">F64<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L204">ctypes.ts:204</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L204">ctypes.ts:204</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -140,7 +140,7 @@
 					<div class="tsd-signature tsd-kind-icon">I32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L201">ctypes.ts:201</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L201">ctypes.ts:201</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -150,7 +150,7 @@
 					<div class="tsd-signature tsd-kind-icon">I64<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L202">ctypes.ts:202</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L202">ctypes.ts:202</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -160,7 +160,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMValue<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L205">ctypes.ts:205</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L205">ctypes.ts:205</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -170,7 +170,7 @@
 					<div class="tsd-signature tsd-kind-icon">U16<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L200">ctypes.ts:200</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L200">ctypes.ts:200</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -180,7 +180,7 @@
 					<div class="tsd-signature tsd-kind-icon">U8<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L199">ctypes.ts:199</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L199">ctypes.ts:199</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/index.html b/docs/reference/api/typedoc/index.html
index ba50f247d..020a27569 100644
--- a/docs/reference/api/typedoc/index.html
+++ b/docs/reference/api/typedoc/index.html
@@ -174,7 +174,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Alloc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>shape<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, ndim<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, dtypeCode<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, dtypeBits<span class="tsd [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L112">ctypes.ts:112</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L112">ctypes.ts:112</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -238,7 +238,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>From<wbr>Bytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, data<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nbytes<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">num [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L128">ctypes.ts:128</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L128">ctypes.ts:128</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -282,7 +282,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>From<wbr>To<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>from<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, to<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, stream<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-sig [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L144">ctypes.ts:144</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L144">ctypes.ts:144</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -326,7 +326,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>ToBytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, data<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nbytes<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</sp [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L136">ctypes.ts:136</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L136">ctypes.ts:136</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -370,7 +370,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L121">ctypes.ts:121</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L121">ctypes.ts:121</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -406,7 +406,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMBackend<wbr>PackedCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>argValues<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, argCodes<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nargs<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number< [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L160">ctypes.ts:160</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L160">ctypes.ts:160</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -458,7 +458,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMCFunc<wbr>Set<wbr>Return<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>ret<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, value<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCode<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signa [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L77">ctypes.ts:77</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L77">ctypes.ts:77</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -506,7 +506,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMCb<wbr>Arg<wbr>ToReturn<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>value<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, code<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span c [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L83">ctypes.ts:83</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L83">ctypes.ts:83</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -545,7 +545,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Call<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>func<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, argValues<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCode<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-t [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L67">ctypes.ts:67</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L67">ctypes.ts:67</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -601,7 +601,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>func<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L57">ctypes.ts:57</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L57">ctypes.ts:57</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -637,7 +637,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Get<wbr>Global<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>name<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, out<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span cla [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L100">ctypes.ts:100</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L100">ctypes.ts:100</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -676,7 +676,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>List<wbr>Global<wbr>Names<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>outSize<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, outArray<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&g [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L88">ctypes.ts:88</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L88">ctypes.ts:88</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -715,7 +715,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Register<wbr>Global<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>name<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, f<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, override<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</spa [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L94">ctypes.ts:94</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L94">ctypes.ts:94</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -758,7 +758,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMGet<wbr>Last<wbr>Error<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L34">ctypes.ts:34</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L34">ctypes.ts:34</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -788,7 +788,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L52">ctypes.ts:52</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L52">ctypes.ts:52</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -824,7 +824,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Get<wbr>Function<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, funcName<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, queryImports<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">numbe [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L42">ctypes.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L42">ctypes.ts:42</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -872,7 +872,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Import<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, dep<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-si [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L48">ctypes.ts:48</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L48">ctypes.ts:48</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -912,7 +912,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMSynchronize<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>deviceType<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, deviceId<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, stream<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signatur [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L150">ctypes.ts:150</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L150">ctypes.ts:150</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -954,7 +954,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Alloc<wbr>Space<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>size<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L167">ctypes.ts:167</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L167">ctypes.ts:167</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -990,7 +990,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Free<wbr>Space<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>ptr<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L170">ctypes.ts:170</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L170">ctypes.ts:170</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1026,7 +1026,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Func<wbr>Create<wbr>FromCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>resource<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, out<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&g [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L187">ctypes.ts:187</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L187">ctypes.ts:187</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1066,7 +1066,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>PackedCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>args<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCodes<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nargs<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L179">ctypes.ts:179</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L179">ctypes.ts:179</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1118,7 +1118,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>PackedCFunc<wbr>Finalizer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>resourceHandle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L193">ctypes.ts:193</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L193">ctypes.ts:193</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1154,7 +1154,7 @@
 					<div class="tsd-signature tsd-kind-icon">GPUPointer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/webgpu.ts#L25">webgpu.ts:25</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/webgpu.ts#L25">webgpu.ts:25</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1169,7 +1169,7 @@
 					<div class="tsd-signature tsd-kind-icon">Packed<wbr>Func<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">...</span>args<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol"> &amp; </span><a href="interfaces/disp [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L36">runtime.ts:36</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L36">runtime.ts:36</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1184,7 +1184,7 @@
 					<div class="tsd-signature tsd-kind-icon">Pointer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L25">ctypes.ts:25</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L25">ctypes.ts:25</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1199,7 +1199,7 @@
 					<div class="tsd-signature tsd-kind-icon">Ptr<wbr>Offset<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/ctypes.ts#L28">ctypes.ts:28</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/ctypes.ts#L28">ctypes.ts:28</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1217,7 +1217,7 @@
 					<div class="tsd-signature tsd-kind-icon">RPC_<wbr>MAGIC<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">1045105</span><span class="tsd-signature-symbol"> = 1045105</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/rpc_server.ts#L36">rpc_server.ts:36</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/rpc_server.ts#L36">rpc_server.ts:36</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1239,7 +1239,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/support.ts#L25">support.ts:25</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/support.ts#L25">support.ts:25</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1271,7 +1271,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/support.ts#L39">support.ts:39</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/support.ts#L39">support.ts:39</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1300,7 +1300,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/support.ts#L52">support.ts:52</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/support.ts#L52">support.ts:52</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1337,7 +1337,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/compact.ts#L38">compact.ts:38</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/compact.ts#L38">compact.ts:38</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1368,7 +1368,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/webgpu.ts#L30">webgpu.ts:30</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/webgpu.ts#L30">webgpu.ts:30</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1390,7 +1390,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/environment.ts#L32">environment.ts:32</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/environment.ts#L32">environment.ts:32</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1421,7 +1421,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/compact.ts#L24">compact.ts:24</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/compact.ts#L24">compact.ts:24</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1443,7 +1443,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L1362">runtime.ts:1362</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L1362">runtime.ts:1362</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1508,7 +1508,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/support.ts#L62">support.ts:62</a></li>
+									<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/support.ts#L62">support.ts:62</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1530,7 +1530,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLData<wbr>Type<wbr>Code<wbr>ToStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L246">runtime.ts:246</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L246">runtime.ts:246</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1539,7 +1539,7 @@
 						<div class="tsd-signature tsd-kind-icon">0<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;int&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L247">runtime.ts:247</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L247">runtime.ts:247</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1549,7 +1549,7 @@
 						<div class="tsd-signature tsd-kind-icon">1<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;uint&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L248">runtime.ts:248</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L248">runtime.ts:248</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1559,7 +1559,7 @@
 						<div class="tsd-signature tsd-kind-icon">2<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;float&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L249">runtime.ts:249</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L249">runtime.ts:249</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1569,7 +1569,7 @@
 						<div class="tsd-signature tsd-kind-icon">3<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;handle&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L250">runtime.ts:250</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L250">runtime.ts:250</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1580,7 +1580,7 @@
 					<div class="tsd-signature tsd-kind-icon">Device<wbr>Enum<wbr>ToStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L175">runtime.ts:175</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L175">runtime.ts:175</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1589,7 +1589,7 @@
 						<div class="tsd-signature tsd-kind-icon">1<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;cpu&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L176">runtime.ts:176</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L176">runtime.ts:176</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1599,7 +1599,7 @@
 						<div class="tsd-signature tsd-kind-icon">15<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;webgpu&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L180">runtime.ts:180</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L180">runtime.ts:180</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1609,7 +1609,7 @@
 						<div class="tsd-signature tsd-kind-icon">2<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;cuda&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L177">runtime.ts:177</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L177">runtime.ts:177</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1619,7 +1619,7 @@
 						<div class="tsd-signature tsd-kind-icon">4<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;opencl&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L178">runtime.ts:178</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L178">runtime.ts:178</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1629,7 +1629,7 @@
 						<div class="tsd-signature tsd-kind-icon">8<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;metal&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L179">runtime.ts:179</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L179">runtime.ts:179</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1640,7 +1640,7 @@
 					<div class="tsd-signature tsd-kind-icon">Device<wbr>Str<wbr>ToEnum<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L183">runtime.ts:183</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L183">runtime.ts:183</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1649,7 +1649,7 @@
 						<div class="tsd-signature tsd-kind-icon">cl<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 4</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L186">runtime.ts:186</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L186">runtime.ts:186</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1659,7 +1659,7 @@
 						<div class="tsd-signature tsd-kind-icon">cpu<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 1</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L184">runtime.ts:184</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L184">runtime.ts:184</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1669,7 +1669,7 @@
 						<div class="tsd-signature tsd-kind-icon">cuda<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 2</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L185">runtime.ts:185</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L185">runtime.ts:185</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1679,7 +1679,7 @@
 						<div class="tsd-signature tsd-kind-icon">metal<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 8</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L189">runtime.ts:189</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L189">runtime.ts:189</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1689,7 +1689,7 @@
 						<div class="tsd-signature tsd-kind-icon">opencl<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 4</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L187">runtime.ts:187</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L187">runtime.ts:187</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1699,7 +1699,7 @@
 						<div class="tsd-signature tsd-kind-icon">vulkan<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 7</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L188">runtime.ts:188</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L188">runtime.ts:188</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1709,7 +1709,7 @@
 						<div class="tsd-signature tsd-kind-icon">webgpu<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 15</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/runtime.ts#L190">runtime.ts:190</a></li>
+								<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/runtime.ts#L190">runtime.ts:190</a></li>
 							</ul>
 						</aside>
 					</section>
diff --git a/docs/reference/api/typedoc/interfaces/disposable.html b/docs/reference/api/typedoc/interfaces/disposable.html
index 2f730d54b..2397bb511 100644
--- a/docs/reference/api/typedoc/interfaces/disposable.html
+++ b/docs/reference/api/typedoc/interfaces/disposable.html
@@ -113,7 +113,7 @@
 					<div class="tsd-signature tsd-kind-icon">dispose<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/types.ts#L52">types.ts:52</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/types.ts#L52">types.ts:52</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/reference/api/typedoc/interfaces/functioninfo.html b/docs/reference/api/typedoc/interfaces/functioninfo.html
index a84a2d377..00f802786 100644
--- a/docs/reference/api/typedoc/interfaces/functioninfo.html
+++ b/docs/reference/api/typedoc/interfaces/functioninfo.html
@@ -95,7 +95,7 @@
 					<div class="tsd-signature tsd-kind-icon">arg_<wbr>types<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/webgpu.ts#L41">webgpu.ts:41</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/webgpu.ts#L41">webgpu.ts:41</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -105,7 +105,7 @@
 					<div class="tsd-signature tsd-kind-icon">launch_<wbr>param_<wbr>tags<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/webgpu.ts#L42">webgpu.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/webgpu.ts#L42">webgpu.ts:42</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -115,7 +115,7 @@
 					<div class="tsd-signature tsd-kind-icon">name<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/webgpu.ts#L40">webgpu.ts:40</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/webgpu.ts#L40">webgpu.ts:40</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/reference/api/typedoc/interfaces/libraryprovider.html b/docs/reference/api/typedoc/interfaces/libraryprovider.html
index 74bf7ed0e..6b2e9693f 100644
--- a/docs/reference/api/typedoc/interfaces/libraryprovider.html
+++ b/docs/reference/api/typedoc/interfaces/libraryprovider.html
@@ -112,7 +112,7 @@
 					<div class="tsd-signature tsd-kind-icon">imports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/types.ts#L34">types.ts:34</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/types.ts#L34">types.ts:34</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -127,7 +127,7 @@
 					<div class="tsd-signature tsd-kind-icon">start<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>inst<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">Instance</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/tvm/blob/3c737fbd5/web/src/types.ts#L39">types.ts:39</a></li>
+							<li>Defined in <a href="https://github.com/apache/tvm/blob/ebbce649f/web/src/types.ts#L39">types.ts:39</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/searchindex.js b/docs/searchindex.js
index 8f63f80ba..6c70993e1 100644
--- a/docs/searchindex.js
+++ b/docs/searchindex.js
@@ -1 +1 @@
-Search.setIndex({docnames:["arch/benchmark","arch/convert_layout","arch/debugger","arch/device_target_interactions","arch/frontend/tensorflow","arch/hybrid_script","arch/index","arch/inferbound","arch/introduction_to_module_serialization","arch/microtvm_design","arch/microtvm_project_api","arch/model_library_format","arch/pass_infra","arch/relay_intro","arch/relay_op_strategy","arch/runtime","arch/runtimes/vulkan","arch/security","arch/virtual_machine","contribute/ci","contribute/code_gu [...]
\ No newline at end of file
+Search.setIndex({docnames:["arch/benchmark","arch/convert_layout","arch/debugger","arch/device_target_interactions","arch/frontend/tensorflow","arch/hybrid_script","arch/index","arch/inferbound","arch/introduction_to_module_serialization","arch/microtvm_design","arch/microtvm_project_api","arch/model_library_format","arch/pass_infra","arch/relay_intro","arch/relay_op_strategy","arch/runtime","arch/runtimes/vulkan","arch/security","arch/virtual_machine","contribute/ci","contribute/code_gu [...]
\ No newline at end of file
diff --git a/docs/topic/vta/tutorials/autotvm/sg_execution_times.html b/docs/topic/vta/tutorials/autotvm/sg_execution_times.html
index e22499f1f..5ba2d1038 100644
--- a/docs/topic/vta/tutorials/autotvm/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/autotvm/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-autotvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:20.586</strong> total execution time for <strong>topic_vta_tutorials_autotvm</strong> files:</p>
+<p><strong>00:21.805</strong> total execution time for <strong>topic_vta_tutorials_autotvm</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 82%" />
@@ -336,7 +336,7 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="tune_relay_vta.html#sphx-glr-topic-vta-tutorials-autotvm-tune-relay-vta-py"><span class="std std-ref">Auto-tuning a convolutional network on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_vta.py</span></code>)</p></td>
-<td><p>00:20.579</p></td>
+<td><p>00:21.799</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="tune_alu_vta.html#sphx-glr-topic-vta-tutorials-autotvm-tune-alu-vta-py"><span class="std std-ref">Auto-tuning a ALU fused op on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_alu_vta.py</span></code>)</p></td>
diff --git a/docs/topic/vta/tutorials/frontend/deploy_classification.html b/docs/topic/vta/tutorials/frontend/deploy_classification.html
index 954c69f4e..9dd9c5780 100644
--- a/docs/topic/vta/tutorials/frontend/deploy_classification.html
+++ b/docs/topic/vta/tutorials/frontend/deploy_classification.html
@@ -571,7 +571,7 @@ and dense layer which will both be executed in fp32 on the CPU.</p></li>
   DeprecationWarning,
 /workspace/vta/tutorials/frontend/deploy_classification.py:213: DeprecationWarning: legacy graph executor behavior of producing json / lib / params will be removed in the next release. Please see documents of tvm.contrib.graph_executor.GraphModule for the  new recommended usage.
   relay_prog, target=tvm.target.Target(target, host=env.target_host), params=params
-resnet18_v1 inference graph built in 22.09s!
+resnet18_v1 inference graph built in 23.75s!
 </pre></div>
 </div>
 </div>
diff --git a/docs/topic/vta/tutorials/frontend/deploy_detection.html b/docs/topic/vta/tutorials/frontend/deploy_detection.html
index 447013d93..8340b1000 100644
--- a/docs/topic/vta/tutorials/frontend/deploy_detection.html
+++ b/docs/topic/vta/tutorials/frontend/deploy_detection.html
@@ -589,7 +589,7 @@ and dense layer which will both be executed in fp32 on the CPU.</p></li>
   &quot;target_host parameter is going to be deprecated. &quot;
 /workspace/python/tvm/relay/build_module.py:411: DeprecationWarning: Please use input parameter mod (tvm.IRModule) instead of deprecated parameter mod (tvm.relay.function.Function)
   DeprecationWarning,
-yolov3-tiny inference graph built in 15.66s!
+yolov3-tiny inference graph built in 16.48s!
 </pre></div>
 </div>
 </div>
diff --git a/docs/topic/vta/tutorials/frontend/sg_execution_times.html b/docs/topic/vta/tutorials/frontend/sg_execution_times.html
index 161d46af7..d687b35ec 100644
--- a/docs/topic/vta/tutorials/frontend/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/frontend/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-frontend-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>01:30.205</strong> total execution time for <strong>topic_vta_tutorials_frontend</strong> files:</p>
+<p><strong>01:33.558</strong> total execution time for <strong>topic_vta_tutorials_frontend</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 84%" />
@@ -336,11 +336,11 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="deploy_detection.html#sphx-glr-topic-vta-tutorials-frontend-deploy-detection-py"><span class="std std-ref">Deploy Pretrained Vision Detection Model from Darknet on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_detection.py</span></code>)</p></td>
-<td><p>00:48.071</p></td>
+<td><p>00:49.428</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="deploy_classification.html#sphx-glr-topic-vta-tutorials-frontend-deploy-classification-py"><span class="std std-ref">Deploy Pretrained Vision Model from MxNet on VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_classification.py</span></code>)</p></td>
-<td><p>00:42.134</p></td>
+<td><p>00:44.131</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/topic/vta/tutorials/optimize/sg_execution_times.html b/docs/topic/vta/tutorials/optimize/sg_execution_times.html
index 50787f393..6459f861a 100644
--- a/docs/topic/vta/tutorials/optimize/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/optimize/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-optimize-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:03.132</strong> total execution time for <strong>topic_vta_tutorials_optimize</strong> files:</p>
+<p><strong>00:03.293</strong> total execution time for <strong>topic_vta_tutorials_optimize</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 84%" />
@@ -336,11 +336,11 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="convolution_opt.html#sphx-glr-topic-vta-tutorials-optimize-convolution-opt-py"><span class="std std-ref">2D Convolution Optimization</span></a> (<code class="docutils literal notranslate"><span class="pre">convolution_opt.py</span></code>)</p></td>
-<td><p>00:02.781</p></td>
+<td><p>00:02.864</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="matrix_multiply_opt.html#sphx-glr-topic-vta-tutorials-optimize-matrix-multiply-opt-py"><span class="std std-ref">Matrix Multiply Blocking</span></a> (<code class="docutils literal notranslate"><span class="pre">matrix_multiply_opt.py</span></code>)</p></td>
-<td><p>00:00.351</p></td>
+<td><p>00:00.429</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/topic/vta/tutorials/sg_execution_times.html b/docs/topic/vta/tutorials/sg_execution_times.html
index c2f013e25..43d001bef 100644
--- a/docs/topic/vta/tutorials/sg_execution_times.html
+++ b/docs/topic/vta/tutorials/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-topic-vta-tutorials-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:00.622</strong> total execution time for <strong>topic_vta_tutorials</strong> files:</p>
+<p><strong>00:00.781</strong> total execution time for <strong>topic_vta_tutorials</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 81%" />
@@ -336,11 +336,11 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="matrix_multiply.html#sphx-glr-topic-vta-tutorials-matrix-multiply-py"><span class="std std-ref">Simple Matrix Multiply</span></a> (<code class="docutils literal notranslate"><span class="pre">matrix_multiply.py</span></code>)</p></td>
-<td><p>00:00.332</p></td>
+<td><p>00:00.420</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="vta_get_started.html#sphx-glr-topic-vta-tutorials-vta-get-started-py"><span class="std std-ref">Get Started with VTA</span></a> (<code class="docutils literal notranslate"><span class="pre">vta_get_started.py</span></code>)</p></td>
-<td><p>00:00.289</p></td>
+<td><p>00:00.361</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 </tbody>
diff --git a/docs/tutorial/auto_scheduler_matmul_x86.html b/docs/tutorial/auto_scheduler_matmul_x86.html
index 181cf1542..317f0436b 100644
--- a/docs/tutorial/auto_scheduler_matmul_x86.html
+++ b/docs/tutorial/auto_scheduler_matmul_x86.html
@@ -566,7 +566,7 @@ operator fusion.</p>
 <span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 93.273 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Execution time of this operator: 95.801 ms
 </pre></div>
 </div>
 </div>
@@ -630,7 +630,6 @@ resume the status and do more 5 trials.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Resume search:
 /usr/local/lib/python3.7/dist-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated.  See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html
   warnings.warn(f&#39;Old style callback is deprecated.  See: {link}&#39;, UserWarning)
-*E*E
 </pre></div>
 </div>
 </div>
@@ -641,7 +640,7 @@ automatically optimize a matrix multiplication, without the need to specify a
 search template.  It ends a series of examples that starts from the Tensor
 Expression (TE) language that demonstrates how TVM can optimize computational
 operations.</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  19.752 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  2.489 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorial-auto-scheduler-matmul-x86-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../_downloads/eac4389b114db015e95cb3cdf8b86b83/auto_scheduler_matmul_x86.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">auto_scheduler_matmul_x86.py</span></code></a></p>
diff --git a/docs/tutorial/autotvm_matmul_x86.html b/docs/tutorial/autotvm_matmul_x86.html
index a1c601c44..9b468fbb4 100644
--- a/docs/tutorial/autotvm_matmul_x86.html
+++ b/docs/tutorial/autotvm_matmul_x86.html
@@ -668,16 +668,16 @@ reduce variance, we take 5 measurements and average them.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>waiting for device...
 device available
 Get devices for measurement successfully!
-No: 1   GFLOPS: 9.94/9.94       result: MeasureResult(costs=(0.0269965032,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5648865699768066, timestamp=1659035020.8669958)       [(&#39;tile_y&#39;, [-1, 1]), (&#39;tile_x&#39;, [-1, 256])],None,80
-No: 2   GFLOPS: 2.82/9.94       result: MeasureResult(costs=(0.0952838752,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6697702407836914, timestamp=1659035022.554008)        [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 8])],None,32
-No: 3   GFLOPS: 11.85/11.85     result: MeasureResult(costs=(0.022660449,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5570068359375, timestamp=1659035023.6011431)   [(&#39;tile_y&#39;, [-1, 64]), (&#39;tile_x&#39;, [-1, 32])],None,56
-No: 4   GFLOPS: 1.65/11.85      result: MeasureResult(costs=(0.16307870060000001,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.727090358734131, timestamp=1659035026.889413)  [(&#39;tile_y&#39;, [-1, 1]), (&#39;tile_x&#39;, [-1, 4])],None,20
-No: 5   GFLOPS: 3.66/11.85      result: MeasureResult(costs=(0.0732470756,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.3117024898529053, timestamp=1659035028.3205142)       [(&#39;tile_y&#39;, [-1, 256]), (&#39;tile_x&#39;, [-1, 16])],None,48
-No: 6   GFLOPS: 1.76/11.85      result: MeasureResult(costs=(0.152845861,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.5684256553649902, timestamp=1659035031.4550264)        [(&#39;tile_y&#39;, [-1, 512]), (&#39;tile_x&#39;, [-1, 4])],None,29
-No: 7   GFLOPS: 0.86/11.85      result: MeasureResult(costs=(0.3130899754,), error_no=MeasureErrorNo.NO_ERROR, all_cost=5.129064559936523, timestamp=1659035036.6268103)        [(&#39;tile_y&#39;, [-1, 512]), (&#39;tile_x&#39;, [-1, 2])],None,19
-No: 8   GFLOPS: 10.57/11.85     result: MeasureResult(costs=(0.025399391799999997,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5462801456451416, timestamp=1659035037.19481) [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 64])],None,62
-No: 9   GFLOPS: 1.62/11.85      result: MeasureResult(costs=(0.16565159599999998,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.748063087463379, timestamp=1659035040.0614984) [(&#39;tile_y&#39;, [-1, 2]), (&#39;tile_x&#39;, [-1, 2])],None,11
-No: 10  GFLOPS: 2.35/11.85      result: MeasureResult(costs=(0.114151147,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.931006669998169, timestamp=1659035042.0504346) [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 4])],None,22
+No: 1   GFLOPS: 10.24/10.24     result: MeasureResult(costs=(0.0262166112,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5573618412017822, timestamp=1659037111.9611266)       [(&#39;tile_y&#39;, [-1, 1]), (&#39;tile_x&#39;, [-1, 256])],None,80
+No: 2   GFLOPS: 2.96/10.24      result: MeasureResult(costs=(0.0906545668,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6005101203918457, timestamp=1659037114.1158283)       [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 8])],None,32
+No: 3   GFLOPS: 11.81/11.81     result: MeasureResult(costs=(0.022734837200000003,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5972039699554443, timestamp=1659037114.6895661)       [(&#39;tile_y&#39;, [-1, 64]), (&#39;tile_x&#39;, [-1, 32])],None,56
+No: 4   GFLOPS: 1.86/11.81      result: MeasureResult(costs=(0.1439960192,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.4247968196868896, timestamp=1659037117.700718)        [(&#39;tile_y&#39;, [-1, 1]), (&#39;tile_x&#39;, [-1, 4])],None,20
+No: 5   GFLOPS: 3.65/11.81      result: MeasureResult(costs=(0.0736059658,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.31510591506958, timestamp=1659037119.148282)  [(&#39;tile_y&#39;, [-1, 256]), (&#39;tile_x&#39;, [-1, 16])],None,48
+No: 6   GFLOPS: 1.82/11.81      result: MeasureResult(costs=(0.1476827578,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.489528179168701, timestamp=1659037122.2242532)        [(&#39;tile_y&#39;, [-1, 512]), (&#39;tile_x&#39;, [-1, 4])],None,29
+No: 7   GFLOPS: 0.87/11.81      result: MeasureResult(costs=(0.30776385300000003,), error_no=MeasureErrorNo.NO_ERROR, all_cost=5.049600601196289, timestamp=1659037127.315)     [(&#39;tile_y&#39;, [-1, 512]), (&#39;tile_x&#39;, [-1, 2])],None,19
+No: 8   GFLOPS: 10.74/11.81     result: MeasureResult(costs=(0.0250025322,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.5460166931152344, timestamp=1659037127.8792002)       [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 64])],None,62
+No: 9   GFLOPS: 1.91/11.81      result: MeasureResult(costs=(0.1407600762,), error_no=MeasureErrorNo.NO_ERROR, all_cost=2.353973627090454, timestamp=1659037130.3529673)        [(&#39;tile_y&#39;, [-1, 2]), (&#39;tile_x&#39;, [-1, 2])],None,11
+No: 10  GFLOPS: 2.80/11.81      result: MeasureResult(costs=(0.095986646,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6423468589782715, timestamp=1659037132.0528631)        [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 4])],None,22
 </pre></div>
 </div>
 <p>With tuning completed, we can choose the configuration from the log file that
diff --git a/docs/tutorial/autotvm_relay_x86.html b/docs/tutorial/autotvm_relay_x86.html
index dbc81cd8b..b9c04979f 100644
--- a/docs/tutorial/autotvm_relay_x86.html
+++ b/docs/tutorial/autotvm_relay_x86.html
@@ -550,7 +550,7 @@ standard deviation.</p>
 <span class="nb">print</span><span class="p">(</span><a href="https://docs.python.org/3/library/stdtypes.html#dict" title="builtins.dict" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">unoptimized</span></a><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>{&#39;mean&#39;: 490.9751048400267, &#39;median&#39;: 490.97516415004065, &#39;std&#39;: 0.751160590199003}
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>{&#39;mean&#39;: 499.76453579998633, &#39;median&#39;: 499.21146484994097, &#39;std&#39;: 1.7238883331097248}
 </pre></div>
 </div>
 </div>
@@ -705,178 +705,178 @@ depending on the specifics of the model and the target platform.</p>
   &quot;target_host parameter is going to be deprecated. &quot;
 
 [Task  1/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  1/25]  Current/Best:   17.28/  17.28 GFLOPS | Progress: (4/20) | 5.76 s
-[Task  1/25]  Current/Best:    6.15/  17.28 GFLOPS | Progress: (8/20) | 9.23 s
-[Task  1/25]  Current/Best:   11.57/  22.70 GFLOPS | Progress: (12/20) | 11.71 s
-[Task  1/25]  Current/Best:   16.76/  22.84 GFLOPS | Progress: (16/20) | 13.39 s
-[Task  1/25]  Current/Best:   11.64/  23.88 GFLOPS | Progress: (20/20) | 15.13 s Done.
+[Task  1/25]  Current/Best:   17.35/  17.35 GFLOPS | Progress: (4/20) | 6.35 s
+[Task  1/25]  Current/Best:    6.15/  17.35 GFLOPS | Progress: (8/20) | 9.39 s
+[Task  1/25]  Current/Best:   11.53/  22.65 GFLOPS | Progress: (12/20) | 11.86 s
+[Task  1/25]  Current/Best:   16.68/  22.65 GFLOPS | Progress: (16/20) | 13.55 s
+[Task  1/25]  Current/Best:   11.57/  23.88 GFLOPS | Progress: (20/20) | 15.29 s Done.
 
 [Task  2/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  2/25]  Current/Best:   12.25/  13.33 GFLOPS | Progress: (4/20) | 3.84 s
-[Task  2/25]  Current/Best:   14.17/  18.67 GFLOPS | Progress: (8/20) | 5.16 s
-[Task  2/25]  Current/Best:   21.17/  21.17 GFLOPS | Progress: (12/20) | 6.47 s
-[Task  2/25]  Current/Best:   12.27/  21.17 GFLOPS | Progress: (16/20) | 7.73 s
-[Task  2/25]  Current/Best:   19.42/  21.17 GFLOPS | Progress: (20/20) | 9.34 s Done.
+[Task  2/25]  Current/Best:   12.15/  13.07 GFLOPS | Progress: (4/20) | 3.94 s
+[Task  2/25]  Current/Best:   13.98/  18.52 GFLOPS | Progress: (8/20) | 5.26 s
+[Task  2/25]  Current/Best:   21.00/  21.00 GFLOPS | Progress: (12/20) | 6.59 s
+[Task  2/25]  Current/Best:   12.16/  21.00 GFLOPS | Progress: (16/20) | 7.89 s
+[Task  2/25]  Current/Best:   19.92/  21.00 GFLOPS | Progress: (20/20) | 9.51 s Done.
 
 [Task  3/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  3/25]  Current/Best:    1.63/  10.54 GFLOPS | Progress: (4/20) | 5.86 s
-[Task  3/25]  Current/Best:   15.60/  16.92 GFLOPS | Progress: (8/20) | 7.79 s
-[Task  3/25]  Current/Best:   14.90/  16.92 GFLOPS | Progress: (12/20) | 9.51 s
-[Task  3/25]  Current/Best:    7.20/  23.84 GFLOPS | Progress: (16/20) | 11.42 s
-[Task  3/25]  Current/Best:   12.69/  23.84 GFLOPS | Progress: (20/20) | 15.93 s Done.
+[Task  3/25]  Current/Best:    1.63/  10.53 GFLOPS | Progress: (4/20) | 5.90 s
+[Task  3/25]  Current/Best:   15.27/  16.84 GFLOPS | Progress: (8/20) | 7.83 s
+[Task  3/25]  Current/Best:   14.87/  16.84 GFLOPS | Progress: (12/20) | 9.56 s
+[Task  3/25]  Current/Best:    7.22/  23.69 GFLOPS | Progress: (16/20) | 11.48 s
+[Task  3/25]  Current/Best:   12.55/  23.69 GFLOPS | Progress: (20/20) | 16.05 s Done.
 
 [Task  4/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  4/25]  Current/Best:    9.46/  20.33 GFLOPS | Progress: (4/20) | 2.37 s
-[Task  4/25]  Current/Best:    6.86/  20.33 GFLOPS | Progress: (8/20) | 6.97 s
-[Task  4/25]  Current/Best:   22.37/  22.37 GFLOPS | Progress: (12/20) | 11.84 s
-[Task  4/25]  Current/Best:   17.39/  22.37 GFLOPS | Progress: (16/20) | 14.19 s
-[Task  4/25]  Current/Best:   13.41/  22.37 GFLOPS | Progress: (20/20) | 16.22 s Done.
+[Task  4/25]  Current/Best:    9.57/  20.46 GFLOPS | Progress: (4/20) | 2.42 s
+[Task  4/25]  Current/Best:    6.78/  20.46 GFLOPS | Progress: (8/20) | 7.17 s
+[Task  4/25]  Current/Best:   21.64/  21.64 GFLOPS | Progress: (12/20) | 12.06 s
+[Task  4/25]  Current/Best:   17.09/  21.64 GFLOPS | Progress: (16/20) | 14.48 s
+[Task  4/25]  Current/Best:   13.20/  21.64 GFLOPS | Progress: (20/20) | 16.47 s Done.
 
 [Task  5/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  5/25]  Current/Best:    9.74/  10.45 GFLOPS | Progress: (4/20) | 2.58 s
-[Task  5/25]  Current/Best:   11.83/  12.80 GFLOPS | Progress: (8/20) | 4.62 s
-[Task  5/25]  Current/Best:   11.86/  18.15 GFLOPS | Progress: (12/20) | 7.79 s
-[Task  5/25]  Current/Best:   11.87/  22.73 GFLOPS | Progress: (16/20) | 9.23 s
-[Task  5/25]  Current/Best:   12.11/  22.73 GFLOPS | Progress: (20/20) | 11.16 s Done.
+[Task  5/25]  Current/Best:    9.75/  10.44 GFLOPS | Progress: (4/20) | 2.62 s
+[Task  5/25]  Current/Best:   11.75/  13.05 GFLOPS | Progress: (8/20) | 4.67 s
+[Task  5/25]  Current/Best:    9.61/  17.86 GFLOPS | Progress: (12/20) | 7.86 s
+[Task  5/25]  Current/Best:   11.85/  22.67 GFLOPS | Progress: (16/20) | 9.29 s
+[Task  5/25]  Current/Best:   11.94/  22.67 GFLOPS | Progress: (20/20) | 11.21 s Done.
 
 [Task  6/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  6/25]  Current/Best:   12.16/  20.69 GFLOPS | Progress: (4/20) | 4.07 s
-[Task  6/25]  Current/Best:   19.08/  20.69 GFLOPS | Progress: (8/20) | 5.83 s
-[Task  6/25]  Current/Best:   13.36/  20.69 GFLOPS | Progress: (12/20) | 7.75 s
-[Task  6/25]  Current/Best:   19.94/  20.69 GFLOPS | Progress: (16/20) | 9.96 s
-[Task  6/25]  Current/Best:    3.73/  20.69 GFLOPS | Progress: (20/20) | 12.48 s Done.
+[Task  6/25]  Current/Best:   12.15/  20.67 GFLOPS | Progress: (4/20) | 4.18 s
+[Task  6/25]  Current/Best:   18.81/  20.67 GFLOPS | Progress: (8/20) | 5.92 s
+[Task  6/25]  Current/Best:   13.26/  20.67 GFLOPS | Progress: (12/20) | 7.86 s
+[Task  6/25]  Current/Best:   19.87/  20.67 GFLOPS | Progress: (16/20) | 10.10 s
+[Task  6/25]  Current/Best:    3.75/  20.67 GFLOPS | Progress: (20/20) | 12.61 s Done.
 
 [Task  7/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  7/25]  Current/Best:   11.29/  12.21 GFLOPS | Progress: (4/20) | 3.60 s
-[Task  7/25]  Current/Best:   20.33/  21.16 GFLOPS | Progress: (8/20) | 5.10 s
-[Task  7/25]  Current/Best:   16.11/  21.16 GFLOPS | Progress: (12/20) | 7.01 s
-[Task  7/25]  Current/Best:   12.30/  21.16 GFLOPS | Progress: (16/20) | 9.03 s
-[Task  7/25]  Current/Best:    6.44/  21.82 GFLOPS | Progress: (20/20) | 11.47 s Done.
+[Task  7/25]  Current/Best:   10.57/  12.30 GFLOPS | Progress: (4/20) | 3.62 s
+[Task  7/25]  Current/Best:   20.10/  21.20 GFLOPS | Progress: (8/20) | 5.16 s
+[Task  7/25]  Current/Best:   16.06/  21.20 GFLOPS | Progress: (12/20) | 7.12 s
+[Task  7/25]  Current/Best:   12.24/  21.20 GFLOPS | Progress: (16/20) | 9.18 s
+[Task  7/25]  Current/Best:    6.27/  21.69 GFLOPS | Progress: (20/20) | 11.68 s Done.
 
 [Task  8/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  8/25]  Current/Best:    9.80/  13.91 GFLOPS | Progress: (4/20) | 2.93 s
-[Task  8/25]  Current/Best:    9.47/  13.91 GFLOPS | Progress: (8/20) | 8.01 s
-[Task  8/25]  Current/Best:   12.43/  13.91 GFLOPS | Progress: (12/20) | 14.33 s
-[Task  8/25]  Current/Best:   18.82/  18.82 GFLOPS | Progress: (16/20) | 16.40 s
-[Task  8/25]  Current/Best:   20.07/  20.07 GFLOPS | Progress: (20/20) | 23.33 s Done.
+[Task  8/25]  Current/Best:   10.44/  14.57 GFLOPS | Progress: (4/20) | 2.95 s
+[Task  8/25]  Current/Best:    9.91/  14.57 GFLOPS | Progress: (8/20) | 8.03 s
+[Task  8/25]  Current/Best:   12.86/  14.57 GFLOPS | Progress: (12/20) | 14.61 s
+[Task  8/25]  Current/Best:   18.99/  18.99 GFLOPS | Progress: (16/20) | 16.69 s
+[Task  8/25]  Current/Best:   20.19/  20.19 GFLOPS | Progress: (20/20) | 23.74 s Done.
 
 [Task  9/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task  9/25]  Current/Best:   14.37/  15.76 GFLOPS | Progress: (4/20) | 11.93 s
-[Task  9/25]  Current/Best:   23.38/  23.38 GFLOPS | Progress: (8/20) | 13.65 s
-[Task  9/25]  Current/Best:    8.27/  23.38 GFLOPS | Progress: (12/20) | 16.16 s
-[Task  9/25]  Current/Best:   17.96/  23.38 GFLOPS | Progress: (16/20) | 18.99 s
-[Task  9/25]  Current/Best:    9.27/  23.38 GFLOPS | Progress: (20/20) | 27.44 s
+[Task  9/25]  Current/Best:   14.28/  14.63 GFLOPS | Progress: (4/20) | 12.01 s
+[Task  9/25]  Current/Best:   23.28/  23.28 GFLOPS | Progress: (8/20) | 13.82 s
+[Task  9/25]  Current/Best:    8.17/  23.28 GFLOPS | Progress: (12/20) | 16.33 s
+[Task  9/25]  Current/Best:   17.84/  23.28 GFLOPS | Progress: (16/20) | 19.20 s
+[Task  9/25]  Current/Best:    9.07/  23.28 GFLOPS | Progress: (20/20) | 27.81 s
 [Task 10/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 10/25]  Current/Best:   18.34/  18.34 GFLOPS | Progress: (4/20) | 2.56 s
-[Task 10/25]  Current/Best:   15.44/  18.34 GFLOPS | Progress: (8/20) | 4.17 s
-[Task 10/25]  Current/Best:   12.75/  18.95 GFLOPS | Progress: (12/20) | 5.69 s
-[Task 10/25]  Current/Best:   19.11/  20.25 GFLOPS | Progress: (16/20) | 6.79 s
-[Task 10/25]  Current/Best:    8.83/  20.25 GFLOPS | Progress: (20/20) | 8.30 s Done.
+[Task 10/25]  Current/Best:   18.25/  18.25 GFLOPS | Progress: (4/20) | 2.61 s
+[Task 10/25]  Current/Best:   15.56/  18.25 GFLOPS | Progress: (8/20) | 4.24 s
+[Task 10/25]  Current/Best:   12.67/  19.01 GFLOPS | Progress: (12/20) | 5.78 s
+[Task 10/25]  Current/Best:   19.04/  20.36 GFLOPS | Progress: (16/20) | 6.89 s
+[Task 10/25]  Current/Best:    8.82/  20.36 GFLOPS | Progress: (20/20) | 8.42 s Done.
 
 [Task 11/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 11/25]  Current/Best:   12.32/  18.10 GFLOPS | Progress: (4/20) | 3.28 s
-[Task 11/25]  Current/Best:   16.89/  18.10 GFLOPS | Progress: (8/20) | 6.05 s
-[Task 11/25]  Current/Best:   18.29/  18.29 GFLOPS | Progress: (12/20) | 8.07 s
-[Task 11/25]  Current/Best:   13.49/  21.24 GFLOPS | Progress: (16/20) | 10.99 s
-[Task 11/25]  Current/Best:   19.44/  21.58 GFLOPS | Progress: (20/20) | 13.06 s Done.
+[Task 11/25]  Current/Best:   12.28/  18.11 GFLOPS | Progress: (4/20) | 3.37 s
+[Task 11/25]  Current/Best:   16.85/  18.11 GFLOPS | Progress: (8/20) | 6.17 s
+[Task 11/25]  Current/Best:   18.11/  18.11 GFLOPS | Progress: (12/20) | 8.25 s
+[Task 11/25]  Current/Best:   13.34/  21.15 GFLOPS | Progress: (16/20) | 11.14 s
+[Task 11/25]  Current/Best:   19.37/  21.57 GFLOPS | Progress: (20/20) | 13.24 s Done.
 
 [Task 12/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 12/25]  Current/Best:    7.84/  17.90 GFLOPS | Progress: (4/20) | 5.56 s
-[Task 12/25]  Current/Best:    5.27/  17.90 GFLOPS | Progress: (8/20) | 9.44 s
-[Task 12/25]  Current/Best:   18.91/  18.91 GFLOPS | Progress: (12/20) | 11.43 s
-[Task 12/25]  Current/Best:   15.52/  18.91 GFLOPS | Progress: (16/20) | 14.29 s
-[Task 12/25]  Current/Best:   15.11/  18.91 GFLOPS | Progress: (20/20) | 16.19 s Done.
+[Task 12/25]  Current/Best:    7.79/  18.09 GFLOPS | Progress: (4/20) | 5.78 s
+[Task 12/25]  Current/Best:    5.25/  18.09 GFLOPS | Progress: (8/20) | 9.70 s
+[Task 12/25]  Current/Best:   18.88/  18.93 GFLOPS | Progress: (12/20) | 11.70 s
+[Task 12/25]  Current/Best:   15.08/  18.93 GFLOPS | Progress: (16/20) | 14.61 s
+[Task 12/25]  Current/Best:   15.23/  19.24 GFLOPS | Progress: (20/20) | 16.53 s Done.
 
 [Task 13/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 13/25]  Current/Best:    8.73/  17.34 GFLOPS | Progress: (4/20) | 3.69 s
-[Task 13/25]  Current/Best:   15.89/  21.22 GFLOPS | Progress: (8/20) | 6.28 s
-[Task 13/25]  Current/Best:   19.69/  21.69 GFLOPS | Progress: (12/20) | 9.34 s
-[Task 13/25]  Current/Best:   12.31/  21.69 GFLOPS | Progress: (16/20) | 12.70 s
-[Task 13/25]  Current/Best:   18.64/  21.69 GFLOPS | Progress: (20/20) | 15.00 s Done.
+[Task 13/25]  Current/Best:    8.91/  17.35 GFLOPS | Progress: (4/20) | 3.77 s
+[Task 13/25]  Current/Best:   16.03/  20.84 GFLOPS | Progress: (8/20) | 6.37 s
+[Task 13/25]  Current/Best:   19.36/  21.60 GFLOPS | Progress: (12/20) | 9.48 s
+[Task 13/25]  Current/Best:   12.23/  21.60 GFLOPS | Progress: (16/20) | 12.99 s
+[Task 13/25]  Current/Best:   18.54/  21.60 GFLOPS | Progress: (20/20) | 15.33 s Done.
 
 [Task 14/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 14/25]  Current/Best:   13.63/  13.63 GFLOPS | Progress: (4/20) | 3.31 s
-[Task 14/25]  Current/Best:    6.11/  13.63 GFLOPS | Progress: (8/20) | 5.50 s
-[Task 14/25]  Current/Best:   20.66/  20.66 GFLOPS | Progress: (12/20) | 8.18 s
-[Task 14/25]  Current/Best:   17.04/  20.66 GFLOPS | Progress: (16/20) | 9.81 s Done.
+[Task 14/25]  Current/Best:   13.59/  13.59 GFLOPS | Progress: (4/20) | 3.47 s
+[Task 14/25]  Current/Best:    6.11/  13.59 GFLOPS | Progress: (8/20) | 5.65 s
+[Task 14/25]  Current/Best:   21.01/  21.01 GFLOPS | Progress: (12/20) | 8.31 s
+[Task 14/25]  Current/Best:   16.53/  21.01 GFLOPS | Progress: (16/20) | 9.98 s Done.
 
-[Task 14/25]  Current/Best:   17.07/  20.66 GFLOPS | Progress: (20/20) | 11.55 s
+[Task 14/25]  Current/Best:   17.14/  21.01 GFLOPS | Progress: (20/20) | 11.76 s
 [Task 15/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 15/25]  Current/Best:   16.18/  17.63 GFLOPS | Progress: (4/20) | 2.71 s
-[Task 15/25]  Current/Best:   13.28/  18.07 GFLOPS | Progress: (8/20) | 4.05 s
-[Task 15/25]  Current/Best:   10.26/  22.37 GFLOPS | Progress: (12/20) | 6.26 s
-[Task 15/25]  Current/Best:   20.45/  22.37 GFLOPS | Progress: (16/20) | 9.50 s
-[Task 15/25]  Current/Best:    9.66/  22.37 GFLOPS | Progress: (20/20) | 10.51 s
+[Task 15/25]  Current/Best:   16.11/  17.63 GFLOPS | Progress: (4/20) | 2.78 s
+[Task 15/25]  Current/Best:   14.32/  17.89 GFLOPS | Progress: (8/20) | 4.14 s
+[Task 15/25]  Current/Best:   10.34/  22.22 GFLOPS | Progress: (12/20) | 6.37 s
+[Task 15/25]  Current/Best:   20.37/  22.22 GFLOPS | Progress: (16/20) | 9.48 s
+[Task 15/25]  Current/Best:    9.63/  22.22 GFLOPS | Progress: (20/20) | 10.50 s
 [Task 16/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 16/25]  Current/Best:   19.13/  19.13 GFLOPS | Progress: (4/20) | 2.90 s
-[Task 16/25]  Current/Best:    3.04/  19.13 GFLOPS | Progress: (8/20) | 4.50 s
-[Task 16/25]  Current/Best:   19.53/  19.53 GFLOPS | Progress: (12/20) | 5.70 s
-[Task 16/25]  Current/Best:   17.96/  19.53 GFLOPS | Progress: (16/20) | 7.04 s
-[Task 16/25]  Current/Best:   10.04/  19.53 GFLOPS | Progress: (20/20) | 9.18 s Done.
+[Task 16/25]  Current/Best:   20.61/  20.61 GFLOPS | Progress: (4/20) | 3.06 s
+[Task 16/25]  Current/Best:    3.02/  20.61 GFLOPS | Progress: (8/20) | 4.69 s
+[Task 16/25]  Current/Best:   19.58/  20.61 GFLOPS | Progress: (12/20) | 5.90 s
+[Task 16/25]  Current/Best:   17.68/  20.61 GFLOPS | Progress: (16/20) | 7.28 s
+[Task 16/25]  Current/Best:   10.01/  20.61 GFLOPS | Progress: (20/20) | 9.45 s Done.
 
 [Task 17/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 17/25]  Current/Best:   13.74/  18.80 GFLOPS | Progress: (4/20) | 4.75 s
-[Task 17/25]  Current/Best:   12.69/  23.45 GFLOPS | Progress: (8/20) | 7.52 s
-[Task 17/25]  Current/Best:   16.88/  23.45 GFLOPS | Progress: (12/20) | 9.56 s
-[Task 17/25]  Current/Best:   16.53/  23.45 GFLOPS | Progress: (16/20) | 11.74 s
-[Task 17/25]  Current/Best:   10.06/  23.45 GFLOPS | Progress: (20/20) | 13.88 s Done.
+[Task 17/25]  Current/Best:   13.26/  18.47 GFLOPS | Progress: (4/20) | 4.86 s
+[Task 17/25]  Current/Best:   14.39/  23.25 GFLOPS | Progress: (8/20) | 7.77 s
+[Task 17/25]  Current/Best:   17.42/  23.25 GFLOPS | Progress: (12/20) | 9.83 s
+[Task 17/25]  Current/Best:   16.54/  23.25 GFLOPS | Progress: (16/20) | 12.05 s
+[Task 17/25]  Current/Best:   10.02/  23.25 GFLOPS | Progress: (20/20) | 14.25 s Done.
 
 [Task 18/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 18/25]  Current/Best:   11.28/  17.11 GFLOPS | Progress: (4/20) | 3.77 s
-[Task 18/25]  Current/Best:   10.58/  17.11 GFLOPS | Progress: (8/20) | 7.43 s
-[Task 18/25]  Current/Best:   19.41/  19.41 GFLOPS | Progress: (12/20) | 9.38 s
-[Task 18/25]  Current/Best:   10.12/  19.41 GFLOPS | Progress: (16/20) | 13.21 s
-[Task 18/25]  Current/Best:   20.82/  20.82 GFLOPS | Progress: (20/20) | 14.73 s Done.
+[Task 18/25]  Current/Best:   11.34/  17.82 GFLOPS | Progress: (4/20) | 3.86 s
+[Task 18/25]  Current/Best:   10.62/  20.07 GFLOPS | Progress: (8/20) | 7.55 s
+[Task 18/25]  Current/Best:   19.16/  20.07 GFLOPS | Progress: (12/20) | 9.50 s
+[Task 18/25]  Current/Best:    9.93/  20.07 GFLOPS | Progress: (16/20) | 13.32 s
+[Task 18/25]  Current/Best:   20.73/  20.73 GFLOPS | Progress: (20/20) | 14.84 s Done.
 
 [Task 19/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 19/25]  Current/Best:    7.27/  20.53 GFLOPS | Progress: (4/20) | 6.01 s
-[Task 19/25]  Current/Best:    2.61/  20.53 GFLOPS | Progress: (8/20) | 9.41 s
-[Task 19/25]  Current/Best:   20.27/  22.00 GFLOPS | Progress: (12/20) | 12.36 s
-[Task 19/25]  Current/Best:   13.91/  22.00 GFLOPS | Progress: (16/20) | 15.35 s
-[Task 19/25]  Current/Best:    2.70/  23.88 GFLOPS | Progress: (20/20) | 18.20 s Done.
+[Task 19/25]  Current/Best:    6.87/  20.21 GFLOPS | Progress: (4/20) | 6.30 s
+[Task 19/25]  Current/Best:    2.60/  20.21 GFLOPS | Progress: (8/20) | 9.63 s
+[Task 19/25]  Current/Best:   19.47/  20.97 GFLOPS | Progress: (12/20) | 12.56 s
+[Task 19/25]  Current/Best:   15.30/  21.11 GFLOPS | Progress: (16/20) | 15.55 s
+[Task 19/25]  Current/Best:    2.70/  23.35 GFLOPS | Progress: (20/20) | 18.32 s Done.
 
 [Task 20/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 20/25]  Current/Best:    9.35/  15.39 GFLOPS | Progress: (4/20) | 3.29 s Done.
+[Task 20/25]  Current/Best:    9.98/  15.08 GFLOPS | Progress: (4/20) | 3.40 s Done.
  Done.
 
-[Task 20/25]  Current/Best:    9.65/  15.39 GFLOPS | Progress: (8/20) | 6.81 s
-[Task 20/25]  Current/Best:    2.32/  16.66 GFLOPS | Progress: (12/20) | 10.71 s
-[Task 20/25]  Current/Best:   12.40/  16.66 GFLOPS | Progress: (16/20) | 14.36 s
-[Task 20/25]  Current/Best:   11.73/  22.23 GFLOPS | Progress: (20/20) | 16.43 s
+[Task 20/25]  Current/Best:   10.35/  15.08 GFLOPS | Progress: (8/20) | 6.82 s
+[Task 20/25]  Current/Best:    2.32/  16.56 GFLOPS | Progress: (12/20) | 10.73 s
+[Task 20/25]  Current/Best:   12.45/  16.56 GFLOPS | Progress: (16/20) | 14.63 s
+[Task 20/25]  Current/Best:   13.26/  21.96 GFLOPS | Progress: (20/20) | 16.74 s
 [Task 21/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 21/25]  Current/Best:    6.42/  17.75 GFLOPS | Progress: (4/20) | 3.25 s
-[Task 21/25]  Current/Best:   13.88/  17.75 GFLOPS | Progress: (8/20) | 4.84 s
-[Task 21/25]  Current/Best:    1.61/  17.75 GFLOPS | Progress: (12/20) | 6.98 s
-[Task 21/25]  Current/Best:   18.13/  18.13 GFLOPS | Progress: (16/20) | 10.49 s
-[Task 21/25]  Current/Best:    4.48/  18.13 GFLOPS | Progress: (20/20) | 17.69 s
+[Task 21/25]  Current/Best:    6.39/  17.57 GFLOPS | Progress: (4/20) | 3.34 s
+[Task 21/25]  Current/Best:   14.51/  17.57 GFLOPS | Progress: (8/20) | 4.95 s
+[Task 21/25]  Current/Best:    1.61/  17.57 GFLOPS | Progress: (12/20) | 7.12 s
+[Task 21/25]  Current/Best:   18.01/  18.01 GFLOPS | Progress: (16/20) | 10.73 s
+[Task 21/25]  Current/Best:    4.46/  18.01 GFLOPS | Progress: (20/20) | 18.13 s
 [Task 22/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 22/25]  Current/Best:    2.71/  17.04 GFLOPS | Progress: (4/20) | 2.66 s
-[Task 22/25]  Current/Best:    8.64/  21.92 GFLOPS | Progress: (8/20) | 4.61 s
-[Task 22/25]  Current/Best:   20.14/  21.92 GFLOPS | Progress: (12/20) | 7.01 s
-[Task 22/25]  Current/Best:   14.97/  21.92 GFLOPS | Progress: (16/20) | 9.11 s
-[Task 22/25]  Current/Best:   14.38/  21.92 GFLOPS | Progress: (20/20) | 10.83 s Done.
+[Task 22/25]  Current/Best:    2.70/  16.88 GFLOPS | Progress: (4/20) | 2.75 s
+[Task 22/25]  Current/Best:    9.08/  21.54 GFLOPS | Progress: (8/20) | 4.80 s
+[Task 22/25]  Current/Best:   19.92/  21.54 GFLOPS | Progress: (12/20) | 7.19 s
+[Task 22/25]  Current/Best:   14.94/  21.54 GFLOPS | Progress: (16/20) | 9.31 s
+[Task 22/25]  Current/Best:   14.94/  21.54 GFLOPS | Progress: (20/20) | 11.05 s Done.
 
 [Task 23/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 23/25]  Current/Best:   17.79/  20.89 GFLOPS | Progress: (4/20) | 3.22 s
-[Task 23/25]  Current/Best:   14.46/  20.89 GFLOPS | Progress: (8/20) | 6.67 s
-[Task 23/25]  Current/Best:   21.03/  21.84 GFLOPS | Progress: (12/20) | 8.46 s
-[Task 23/25]  Current/Best:    6.52/  21.84 GFLOPS | Progress: (16/20) | 15.33 s
-[Task 23/25]  Current/Best:    7.97/  21.84 GFLOPS | Progress: (20/20) | 19.52 s Done.
+[Task 23/25]  Current/Best:   17.38/  20.26 GFLOPS | Progress: (4/20) | 3.31 s
+[Task 23/25]  Current/Best:   15.65/  20.26 GFLOPS | Progress: (8/20) | 6.70 s
+[Task 23/25]  Current/Best:   20.78/  21.35 GFLOPS | Progress: (12/20) | 8.56 s
+[Task 23/25]  Current/Best:    6.18/  21.35 GFLOPS | Progress: (16/20) | 15.66 s
+[Task 23/25]  Current/Best:    7.55/  21.35 GFLOPS | Progress: (20/20) | 19.97 s Done.
 
 [Task 24/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 24/25]  Current/Best:    8.52/   8.52 GFLOPS | Progress: (4/20) | 11.76 s
-[Task 24/25]  Current/Best:    3.75/   8.52 GFLOPS | Progress: (8/20) | 22.98 s
-[Task 24/25]  Current/Best:    4.44/   8.52 GFLOPS | Progress: (12/20) | 33.69 s Done.
+[Task 24/25]  Current/Best:    8.54/   8.54 GFLOPS | Progress: (4/20) | 11.89 s
+[Task 24/25]  Current/Best:    3.44/   8.54 GFLOPS | Progress: (8/20) | 23.17 s
+[Task 24/25]  Current/Best:    4.33/   8.54 GFLOPS | Progress: (12/20) | 33.90 s Done.
 
-[Task 24/25]  Current/Best:    6.12/   9.01 GFLOPS | Progress: (16/20) | 39.21 s
-[Task 24/25]  Current/Best:    3.39/   9.01 GFLOPS | Progress: (20/20) | 45.13 s Done.
+[Task 24/25]  Current/Best:    7.36/   8.70 GFLOPS | Progress: (16/20) | 39.71 s
+[Task 24/25]  Current/Best:    3.27/   9.03 GFLOPS | Progress: (20/20) | 45.71 s Done.
 
 [Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/20) | 0.00 s
-[Task 25/25]  Current/Best:    1.55/   2.75 GFLOPS | Progress: (4/20) | 11.58 s
-[Task 25/25]  Current/Best:    6.01/   8.38 GFLOPS | Progress: (8/20) | 22.82 s
-[Task 25/25]  Current/Best:    6.08/   8.38 GFLOPS | Progress: (12/20) | 34.09 s
-[Task 25/25]  Current/Best:    5.86/   8.99 GFLOPS | Progress: (16/20) | 35.96 s
-[Task 25/25]  Current/Best:    2.89/   9.40 GFLOPS | Progress: (20/20) | 46.62 s
+[Task 25/25]  Current/Best:    1.55/   2.93 GFLOPS | Progress: (4/20) | 11.65 s
+[Task 25/25]  Current/Best:    5.76/   8.11 GFLOPS | Progress: (8/20) | 22.97 s
+[Task 25/25]  Current/Best:    5.93/   8.11 GFLOPS | Progress: (12/20) | 34.40 s
+[Task 25/25]  Current/Best:    5.86/   9.34 GFLOPS | Progress: (16/20) | 36.28 s
+[Task 25/25]  Current/Best:    2.94/   9.34 GFLOPS | Progress: (20/20) | 46.99 s
 </pre></div>
 </div>
 <p>The output from this tuning process will look something like this:</p>
@@ -980,8 +980,8 @@ improvement in comparing the optimized model to the unoptimized model.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;unoptimized: </span><span class="si">%s</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="p">(</span><a href="https://docs.python.org/3/library/stdtypes.html#dict" title="builtins.dict" class="sphx-glr-backref-module-builtins sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">unoptimized</span></a><span class="p">))</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>optimized: {&#39;mean&#39;: 415.2614426499713, &#39;median&#39;: 415.2566297500016, &#39;std&#39;: 0.5934935002330597}
-unoptimized: {&#39;mean&#39;: 490.9751048400267, &#39;median&#39;: 490.97516415004065, &#39;std&#39;: 0.751160590199003}
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>optimized: {&#39;mean&#39;: 417.61745977002647, &#39;median&#39;: 417.75443490005273, &#39;std&#39;: 0.7636631550773011}
+unoptimized: {&#39;mean&#39;: 499.76453579998633, &#39;median&#39;: 499.21146484994097, &#39;std&#39;: 1.7238883331097248}
 </pre></div>
 </div>
 </div>
@@ -995,7 +995,7 @@ models.</p>
 <p>Here we presented a simple example using ResNet-50 v2 locally. However, TVM
 supports many more features including cross-compilation, remote execution and
 profiling/benchmarking.</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 10 minutes  19.119 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 10 minutes  32.758 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorial-autotvm-relay-x86-py">
 <div class="sphx-glr-download sphx-glr-download-python docutils container">
 <p><a class="reference download internal" download="" href="../_downloads/57a45d9bef1af358191e7d50043e652c/autotvm_relay_x86.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">autotvm_relay_x86.py</span></code></a></p>
diff --git a/docs/tutorial/cross_compilation_and_rpc.html b/docs/tutorial/cross_compilation_and_rpc.html
index 61b0cba25..a8c4e7813 100644
--- a/docs/tutorial/cross_compilation_and_rpc.html
+++ b/docs/tutorial/cross_compilation_and_rpc.html
@@ -526,7 +526,7 @@ device and returns the measured cost. Network overhead is excluded.</p>
 <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;</span><span class="si">%g</span><span class="s2"> secs/op&quot;</span> <span class="o">%</span> <span class="n">cost</span><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>1.259e-07 secs/op
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>1.256e-07 secs/op
 </pre></div>
 </div>
 </div>
diff --git a/docs/tutorial/intro_topi.html b/docs/tutorial/intro_topi.html
index 372672b1a..c1667bf11 100644
--- a/docs/tutorial/intro_topi.html
+++ b/docs/tutorial/intro_topi.html
@@ -483,7 +483,7 @@ we can schedule the following series of operations ending with <code class="code
 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="nb">print</span><span class="p">(</span><a href="../reference/api/python/ir.html#tvm.ir.Array" title="tvm.ir.Array" class="sphx-glr-backref-module-tvm-ir sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">sg</span><span class="o">.</span><span class="n">stages</span></a><span class="p">)</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>[stage(a, placeholder(a, 0x201e1120)), stage(b, placeholder(b, 0x201eec20)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[ [...]
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>[stage(a, placeholder(a, 0xbdc5a40)), stage(b, placeholder(b, 0x19afbd60)), stage(T_add, compute(T_add, body=[(a[ax0, ax1, ax2] + b[ax1, ax2])], axis=[iter_var(ax0, range(min=0, ext=100)), iter_var(ax1, range(min=0, ext=10)), iter_var(ax2, range(min=0, ext=10))], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[(a[ax0, ax1, ax2]*b[ax1, ax2])], axis=[i [...]
 </pre></div>
 </div>
 <p>We can test the correctness by comparing with <code class="code docutils literal notranslate"><span class="pre">numpy</span></code> result as follows</p>
diff --git a/docs/tutorial/sg_execution_times.html b/docs/tutorial/sg_execution_times.html
index fcf922074..baab63452 100644
--- a/docs/tutorial/sg_execution_times.html
+++ b/docs/tutorial/sg_execution_times.html
@@ -327,7 +327,7 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-tutorial-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>13:34.121</strong> total execution time for <strong>tutorial</strong> files:</p>
+<p><strong>13:30.966</strong> total execution time for <strong>tutorial</strong> files:</p>
 <table class="docutils align-default">
 <colgroup>
 <col style="width: 83%" />
@@ -336,50 +336,50 @@
 </colgroup>
 <tbody>
 <tr class="row-odd"><td><p><a class="reference internal" href="autotvm_relay_x86.html#sphx-glr-tutorial-autotvm-relay-x86-py"><span class="std std-ref">Compiling and Optimizing a Model with the Python Interface (AutoTVM)</span></a> (<code class="docutils literal notranslate"><span class="pre">autotvm_relay_x86.py</span></code>)</p></td>
-<td><p>10:19.119</p></td>
+<td><p>10:32.758</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="auto_scheduler_matmul_x86.html#sphx-glr-tutorial-auto-scheduler-matmul-x86-py"><span class="std std-ref">Optimizing Operators with Auto-scheduling</span></a> (<code class="docutils literal notranslate"><span class="pre">auto_scheduler_matmul_x86.py</span></code>)</p></td>
-<td><p>01:19.752</p></td>
+<td><p>01:02.489</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="tensor_expr_get_started.html#sphx-glr-tutorial-tensor-expr-get-started-py"><span class="std std-ref">Working with Operators Using Tensor Expression</span></a> (<code class="docutils literal notranslate"><span class="pre">tensor_expr_get_started.py</span></code>)</p></td>
-<td><p>00:59.336</p></td>
+<td><p>00:59.931</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="relay_quick_start.html#sphx-glr-tutorial-relay-quick-start-py"><span class="std std-ref">Quick Start Tutorial for Compiling Deep Learning Models</span></a> (<code class="docutils literal notranslate"><span class="pre">relay_quick_start.py</span></code>)</p></td>
-<td><p>00:29.731</p></td>
+<td><p>00:30.726</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="autotvm_matmul_x86.html#sphx-glr-tutorial-autotvm-matmul-x86-py"><span class="std std-ref">Optimizing Operators with Schedule Templates and AutoTVM</span></a> (<code class="docutils literal notranslate"><span class="pre">autotvm_matmul_x86.py</span></code>)</p></td>
-<td><p>00:24.637</p></td>
+<td><p>00:23.659</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="intro_topi.html#sphx-glr-tutorial-intro-topi-py"><span class="std std-ref">Introduction to TOPI</span></a> (<code class="docutils literal notranslate"><span class="pre">intro_topi.py</span></code>)</p></td>
-<td><p>00:00.693</p></td>
+<td><p>00:00.713</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="tensor_ir_blitz_course.html#sphx-glr-tutorial-tensor-ir-blitz-course-py"><span class="std std-ref">Blitz Course to TensorIR</span></a> (<code class="docutils literal notranslate"><span class="pre">tensor_ir_blitz_course.py</span></code>)</p></td>
-<td><p>00:00.689</p></td>
+<td><p>00:00.519</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="cross_compilation_and_rpc.html#sphx-glr-tutorial-cross-compilation-and-rpc-py"><span class="std std-ref">Cross Compilation and RPC</span></a> (<code class="docutils literal notranslate"><span class="pre">cross_compilation_and_rpc.py</span></code>)</p></td>
-<td><p>00:00.157</p></td>
+<td><p>00:00.164</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="introduction.html#sphx-glr-tutorial-introduction-py"><span class="std std-ref">Introduction</span></a> (<code class="docutils literal notranslate"><span class="pre">introduction.py</span></code>)</p></td>
 <td><p>00:00.005</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="tvmc_python.html#sphx-glr-tutorial-tvmc-python-py"><span class="std std-ref">Getting Starting using TVMC Python: a high-level API for TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">tvmc_python.py</span></code>)</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="tvmc_command_line_driver.html#sphx-glr-tutorial-tvmc-command-line-driver-py"><span class="std std-ref">Compiling and Optimizing a Model with TVMC</span></a> (<code class="docutils literal notranslate"><span class="pre">tvmc_command_line_driver.py</span></code>)</p></td>
 <td><p>00:00.001</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="install.html#sphx-glr-tutorial-install-py"><span class="std std-ref">Installing TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">install.py</span></code>)</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="tvmc_python.html#sphx-glr-tutorial-tvmc-python-py"><span class="std std-ref">Getting Starting using TVMC Python: a high-level API for TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">tvmc_python.py</span></code>)</p></td>
 <td><p>00:00.001</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
-<tr class="row-even"><td><p><a class="reference internal" href="tvmc_command_line_driver.html#sphx-glr-tutorial-tvmc-command-line-driver-py"><span class="std std-ref">Compiling and Optimizing a Model with TVMC</span></a> (<code class="docutils literal notranslate"><span class="pre">tvmc_command_line_driver.py</span></code>)</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="install.html#sphx-glr-tutorial-install-py"><span class="std std-ref">Installing TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">install.py</span></code>)</p></td>
 <td><p>00:00.001</p></td>
 <td><p>0.0 MB</p></td>
 </tr>
diff --git a/docs/tutorial/tensor_expr_get_started.html b/docs/tutorial/tensor_expr_get_started.html
index 96e66bb95..a08b79130 100644
--- a/docs/tutorial/tensor_expr_get_started.html
+++ b/docs/tutorial/tensor_expr_get_started.html
@@ -541,8 +541,8 @@ helper function to run a profile of the TVM generated code.</p>
 <span class="n">evaluate_addition</span><span class="p">(</span><span class="n">fadd</span><span class="p">,</span> <a href="../reference/api/python/target.html#tvm.target.Target" title="tvm.target.Target" class="sphx-glr-backref-module-tvm-target sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">tgt</span></a><span class="p">,</span> <span class="s2">&quot;naive&quot;</span><span class="p">,</span> <a href="https://docs.python.org/3/library/stdtypes.html#list" ti [...]
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.000008
-naive: 0.000006
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.000007
+naive: 0.000007
 </pre></div>
 </div>
 </div>
@@ -634,7 +634,7 @@ factor to be the number of threads on your CPU.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-vector: 0.000025
+vector: 0.000026
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [(stride: int32*n: int32)], [], type=&quot;auto&quot;),
@@ -667,10 +667,10 @@ vector: 0.000025
 </pre></div>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Operator                  Timing             Performance
-   numpy    8.126039992930601e-06                    1.0
-   naive              5.9734e-06      0.7350936009663588
-parallel              6.9913e-06      0.8603575672876593
-  vector              2.4531e-05      3.0188135944865144
+   numpy    7.137789998523658e-06                    1.0
+   naive              6.6301e-06      0.9288729426575081
+parallel    6.8720000000000004e-06    0.9627629842600256
+  vector    2.5590000000000004e-05    3.5851433013990186
 </pre></div>
 </div>
 <div class="admonition-code-specialization admonition">
@@ -986,7 +986,7 @@ matrix multiplication.</p>
 <span class="n">answer</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">b</span><span class="o">.</span><span class="n">numpy</span><span class="p">())</span>
 </pre></div>
 </div>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.017564
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Numpy running time: 0.018889
 </pre></div>
 </div>
 <p>Now we write a basic matrix multiplication using TVM TE and verify that it
@@ -1029,7 +1029,7 @@ optimizations.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-none: 3.298072
+none: 3.289132
 </pre></div>
 </div>
 <p>Let’s take a look at the intermediate representation of the operator and
@@ -1096,7 +1096,7 @@ schedule.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-blocking: 0.307013
+blocking: 0.329313
 </pre></div>
 </div>
 <p>By reordering the computation to take advantage of caching, you should see a
@@ -1157,7 +1157,7 @@ already cache friendly from our previous optimizations.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-vectorization: 0.340419
+vectorization: 0.358022
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1214,7 +1214,7 @@ more cache friendly.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-loop permutation: 0.111652
+loop permutation: 0.116325
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1292,7 +1292,7 @@ optimized schedule.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-array packing: 0.108105
+array packing: 0.109678
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1368,7 +1368,7 @@ to `C</cite> when all the block results are ready.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-block caching: 0.110703
+block caching: 0.110596
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1437,7 +1437,7 @@ of thread-level parallelization.</p>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>/workspace/python/tvm/driver/build_module.py:268: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
   &quot;target_host parameter is going to be deprecated. &quot;
-parallelization: 0.144381
+parallelization: 0.144771
 @main = primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;from_legacy_te_schedule&quot;: True, &quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
@@ -1499,13 +1499,13 @@ working, we can compare the results.</p>
 </pre></div>
 </div>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>        Operator                  Timing             Performance
-            none            3.2980724726                     1.0
-        blocking            0.3070129586     0.09308860285837492
-   vectorization            0.3404186675      0.1032174612074654
-loop permutation     0.11165196200000001    0.033853701799336264
-   array packing     0.10810546380000001     0.03277837727888867
-   block caching     0.11070322930000001     0.03356603901815666
- parallelization     0.14438079519999997     0.04377732642308461
+            none            3.2891317263                     1.0
+        blocking            0.3293131147     0.10012159502971622
+   vectorization            0.3580215504     0.10884986683179895
+loop permutation            0.1163250125    0.035366480329705734
+   array packing     0.10967751880000001    0.033345432146427934
+   block caching     0.11059642839999999     0.03362480970757951
+ parallelization            0.1447713667    0.044015071072527626
 </pre></div>
 </div>
 <p>Note that the outputs on the web page reflect the running times on a