You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by lm...@apache.org on 2020/09/22 08:01:14 UTC

[incubator-tvm-site] branch asf-site updated: Docs build at Tue Sep 22 01:00:59 PDT 2020

This is an automated email from the ASF dual-hosted git repository.

lmzheng pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-tvm-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new a09e774  Docs build at Tue Sep 22 01:00:59 PDT 2020
a09e774 is described below

commit a09e774a88004011531c9127e37ac41ebc3cadf0
Author: Lianmin Zheng <li...@gmail.com>
AuthorDate: Tue Sep 22 01:01:00 2020 -0700

    Docs build at Tue Sep 22 01:00:59 PDT 2020
---
 .../tune_conv2d_layer_cuda.py                      |   12 +-
 .../tune_matmul_x86.py                             |   15 +-
 .../tune_conv2d_layer_cuda.ipynb                   |    2 +-
 .../tune_matmul_x86.ipynb                          |    8 +-
 docs/_sources/api/python/auto_scheduler.rst.txt    |   32 +-
 .../auto_scheduler/sg_execution_times.rst.txt      |    6 +-
 .../auto_scheduler/tune_conv2d_layer_cuda.rst.txt  | 1431 +++++--------------
 .../auto_scheduler/tune_matmul_x86.rst.txt         |  361 ++++-
 .../tutorials/autotvm/sg_execution_times.rst.txt   |   16 +-
 .../tutorials/autotvm/tune_conv2d_cuda.rst.txt     |   42 +-
 .../tutorials/autotvm/tune_simple_template.rst.txt |   20 +-
 .../tutorials/dev/sg_execution_times.rst.txt       |    6 +-
 .../frontend/deploy_model_on_android.rst.txt       |    2 +-
 .../deploy_object_detection_pytorch.rst.txt        |    2 +-
 .../tutorials/frontend/deploy_prequantized.rst.txt |    2 +-
 .../frontend/deploy_prequantized_tflite.rst.txt    |    4 +-
 .../tutorials/frontend/deploy_ssd_gluoncv.rst.txt  |    2 +-
 docs/_sources/tutorials/frontend/from_onnx.rst.txt |    2 +-
 .../tutorials/frontend/sg_execution_times.rst.txt  |   40 +-
 .../get_started/cross_compilation_and_rpc.rst.txt  |    2 +-
 .../get_started/relay_quick_start.rst.txt          |    2 +-
 .../get_started/sg_execution_times.rst.txt         |    6 +-
 .../tutorials/language/schedule_primitives.rst.txt |   20 +-
 .../tutorials/language/sg_execution_times.rst.txt  |   12 +-
 docs/_sources/tutorials/language/tensorize.rst.txt |    8 +-
 .../tutorials/language/tuple_inputs.rst.txt        |   24 +-
 .../tutorials/micro/sg_execution_times.rst.txt     |    4 +-
 .../tutorials/optimize/opt_conv_cuda.rst.txt       |    2 +-
 .../tutorials/optimize/opt_conv_tensorcore.rst.txt |   10 +-
 docs/_sources/tutorials/optimize/opt_gemm.rst.txt  |   32 +-
 .../tutorials/optimize/sg_execution_times.rst.txt  |   10 +-
 docs/_sources/tutorials/topi/intro_topi.rst.txt    |    2 +-
 .../tutorials/topi/sg_execution_times.rst.txt      |    4 +-
 .../tutorials/autotvm/sg_execution_times.rst.txt   |    4 +-
 .../vta/tutorials/autotvm/tune_relay_vta.rst.txt   |    2 +-
 .../frontend/deploy_classification.rst.txt         |    4 +-
 .../tutorials/frontend/sg_execution_times.rst.txt  |    4 +-
 .../vta/tutorials/optimize/convolution_opt.rst.txt |    4 +-
 .../tutorials/optimize/matrix_multiply_opt.rst.txt |    8 +-
 .../tutorials/optimize/sg_execution_times.rst.txt  |    6 +-
 .../vta/tutorials/sg_execution_times.rst.txt       |    4 +-
 .../_sources/vta/tutorials/vta_get_started.rst.txt |    4 +-
 docs/api/python/auto_scheduler.html                |  891 ++++++++++--
 docs/api/python/autotvm.html                       |   30 +-
 docs/api/python/index.html                         |    7 +-
 docs/api/typedoc/classes/bytestreamreader.html     |   12 +-
 docs/api/typedoc/classes/cachedcallstack.html      |   34 +-
 docs/api/typedoc/classes/dlcontext.html            |   10 +-
 docs/api/typedoc/classes/dldatatype.html           |   12 +-
 docs/api/typedoc/classes/environment.html          |   12 +-
 docs/api/typedoc/classes/ffilibrary.html           |   20 +-
 docs/api/typedoc/classes/graphruntime.html         |   16 +-
 docs/api/typedoc/classes/instance.html             |   40 +-
 docs/api/typedoc/classes/memory.html               |   34 +-
 docs/api/typedoc/classes/module.html               |   10 +-
 docs/api/typedoc/classes/ndarray.html              |   22 +-
 docs/api/typedoc/classes/packedfunccell.html       |    6 +-
 docs/api/typedoc/classes/rpcserver.html            |   14 +-
 docs/api/typedoc/classes/scalar.html               |    6 +-
 docs/api/typedoc/classes/webgpucontext.html        |   12 +-
 docs/api/typedoc/enums/argtypecode.html            |   30 +-
 docs/api/typedoc/enums/aynccallbackcode.html       |    4 +-
 docs/api/typedoc/enums/dldatatypecode.html         |    8 +-
 docs/api/typedoc/enums/rpcserverstate.html         |   12 +-
 docs/api/typedoc/enums/sizeof.html                 |   18 +-
 docs/api/typedoc/index.html                        |  114 +-
 docs/api/typedoc/interfaces/disposable.html        |    2 +-
 docs/api/typedoc/interfaces/functioninfo.html      |    6 +-
 docs/api/typedoc/interfaces/libraryprovider.html   |    4 +-
 docs/genindex.html                                 |  142 +-
 docs/objects.inv                                   |  Bin 16684 -> 16936 bytes
 docs/py-modindex.html                              |   10 -
 docs/searchindex.js                                |    2 +-
 .../auto_scheduler/sg_execution_times.html         |    6 +-
 .../auto_scheduler/tune_conv2d_layer_cuda.html     | 1455 +++++---------------
 docs/tutorials/auto_scheduler/tune_matmul_x86.html |  386 +++++-
 docs/tutorials/autotvm/sg_execution_times.html     |   14 +-
 docs/tutorials/autotvm/tune_conv2d_cuda.html       |   42 +-
 docs/tutorials/autotvm/tune_simple_template.html   |   20 +-
 docs/tutorials/dev/sg_execution_times.html         |    6 +-
 .../frontend/deploy_model_on_android.html          |    2 +-
 .../frontend/deploy_object_detection_pytorch.html  |    2 +-
 docs/tutorials/frontend/deploy_prequantized.html   |    2 +-
 .../frontend/deploy_prequantized_tflite.html       |    4 +-
 docs/tutorials/frontend/deploy_ssd_gluoncv.html    |    2 +-
 docs/tutorials/frontend/from_onnx.html             |    6 +-
 docs/tutorials/frontend/sg_execution_times.html    |   40 +-
 .../get_started/cross_compilation_and_rpc.html     |    2 +-
 docs/tutorials/get_started/relay_quick_start.html  |  102 +-
 docs/tutorials/get_started/sg_execution_times.html |    6 +-
 docs/tutorials/language/schedule_primitives.html   |   20 +-
 docs/tutorials/language/sg_execution_times.html    |   12 +-
 docs/tutorials/language/tensorize.html             |    8 +-
 docs/tutorials/language/tuple_inputs.html          |   24 +-
 docs/tutorials/micro/sg_execution_times.html       |    4 +-
 docs/tutorials/optimize/opt_conv_cuda.html         |    2 +-
 docs/tutorials/optimize/opt_conv_tensorcore.html   |   10 +-
 docs/tutorials/optimize/opt_gemm.html              |   32 +-
 docs/tutorials/optimize/sg_execution_times.html    |   10 +-
 docs/tutorials/topi/intro_topi.html                |    2 +-
 docs/tutorials/topi/sg_execution_times.html        |    4 +-
 docs/vta/tutorials/autotvm/sg_execution_times.html |    4 +-
 docs/vta/tutorials/autotvm/tune_relay_vta.html     |  184 +--
 .../tutorials/frontend/deploy_classification.html  |   18 +-
 .../vta/tutorials/frontend/sg_execution_times.html |    4 +-
 docs/vta/tutorials/optimize/convolution_opt.html   |    4 +-
 .../tutorials/optimize/matrix_multiply_opt.html    |    8 +-
 .../vta/tutorials/optimize/sg_execution_times.html |    6 +-
 docs/vta/tutorials/sg_execution_times.html         |    4 +-
 docs/vta/tutorials/vta_get_started.html            |    4 +-
 110 files changed, 2980 insertions(+), 3204 deletions(-)

diff --git a/docs/_downloads/678f3c372a599a18d909aed0fefb30be/tune_conv2d_layer_cuda.py b/docs/_downloads/678f3c372a599a18d909aed0fefb30be/tune_conv2d_layer_cuda.py
index 98e66bb..74b3775 100644
--- a/docs/_downloads/678f3c372a599a18d909aed0fefb30be/tune_conv2d_layer_cuda.py
+++ b/docs/_downloads/678f3c372a599a18d909aed0fefb30be/tune_conv2d_layer_cuda.py
@@ -74,20 +74,20 @@ print(task.compute_dag)
 # Next, we set parameters for the auto-scheduler. These parameters
 # mainly specify how we do the measurement during the search and auto-tuning.
 #
-# * `measure_ctx` launches a different process for measurement. This
+# * :code:`measure_ctx` launches a different process for measurement. This
 #   provides an isolation. It can protect the master process from GPU crashes
 #   happended during measurement and avoid other runtime conflicts.
-# * `min_repeat_ms` defines the minimum duration of one "repeat" in every measurement.
+# * :code:`min_repeat_ms` defines the minimum duration of one "repeat" in every measurement.
 #   This can warmup the GPU, which is necessary to get accurate measurement results.
 #   Typically, we recommend a value > 300 ms.
-# * `num_measure_trials` is the number of measurement trials we can use during the search.
+# * :code:`num_measure_trials` is the number of measurement trials we can use during the search.
 #   We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a
 #   good value for the search to converge. You can do more trials according to your time budget.
-# * In addition, we use `RecordToFile` to dump measurement records into a file `conv2d.json`.
+# * In addition, we use :code:`RecordToFile` to dump measurement records into a file `conv2d.json`.
 #   The measurement records can be used to query the history best, resume the search,
 #   and do more analyses later.
-# * see :any:`auto_scheduler.auto_schedule.TuningOptions`:,
-#   :any:`auto_scheduler.measure.LocalRPCMeasureContext` for more parameters.
+# * see :any:`auto_scheduler.TuningOptions`,
+#   :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters.
 
 measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
 tune_option = auto_scheduler.TuningOptions(
diff --git a/docs/_downloads/91b0339c8f3cc2594cee580dc450149a/tune_matmul_x86.py b/docs/_downloads/91b0339c8f3cc2594cee580dc450149a/tune_matmul_x86.py
index 918030d..e5f9d7e 100644
--- a/docs/_downloads/91b0339c8f3cc2594cee580dc450149a/tune_matmul_x86.py
+++ b/docs/_downloads/91b0339c8f3cc2594cee580dc450149a/tune_matmul_x86.py
@@ -60,11 +60,12 @@ def matmul_add(N, L, M, dtype):
 # ^^^^^^^^^^^^^^^^^^^^^^
 # We then create a search task with N=L=M=128 and dtype="float32"
 # If your machine supports avx instructions, you can
-# - replace "llvm" below with "llvm -mcpu=core-avx2" to enable AVX2
-# - replace "llvm" below with "llvm -mcpu=skylake-avx512" to enable AVX-512
+#
+#   - replace "llvm" below with "llvm -mcpu=core-avx2" to enable AVX2
+#   - replace "llvm" below with "llvm -mcpu=skylake-avx512" to enable AVX-512
 
 target = tvm.target.Target("llvm")
-task = auto_scheduler.create_task(matmul_add, (128, 128, 128, "float32"), target)
+task = tvm.auto_scheduler.create_task(matmul_add, (128, 128, 128, "float32"), target)
 
 # Inspect the computational graph
 print(task.compute_dag)
@@ -72,13 +73,13 @@ print(task.compute_dag)
 ######################################################################
 # Next, we set parameters for the auto-scheduler.
 #
-# * `num_measure_trials` is the number of measurement trials we can use during the search.
+# * :code:`num_measure_trials` is the number of measurement trials we can use during the search.
 #   We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a
 #   good value for the search to converge. You can do more trials according to your time budget.
-# * In addition, we use `RecordToFile` to dump measurement records into a file `matmul.json`.
+# * In addition, we use :code:`RecordToFile` to dump measurement records into a file `matmul.json`.
 #   The measurement records can be used to query the history best, resume the search,
 #   and do more analyses later.
-# * see :any:`auto_scheduler.auto_schedule.TuningOptions`: for more parameters
+# * see :any:`auto_scheduler.TuningOptions` for more parameters
 
 tune_option = auto_scheduler.TuningOptions(
     num_measure_trials=10, measure_callbacks=[auto_scheduler.RecordToFile("matmul.json")]
@@ -189,5 +190,5 @@ def resume_search(task, log_file):
 #   For example, you can start a new thread/process (with the builtin python library
 #   threading or multiprocessing) and run the tvm binaries in the new thread/process.
 #   This provides an isolation and avoids the conflict in the main thread/process.
-#   You can also use :any:`auto_scheduler.measure.LocalRPCMeasureContext` for auto-scheduler,
+#   You can also use :any:`auto_scheduler.LocalRPCMeasureContext` for auto-scheduler,
 #   as shown in the GPU tutorial (:ref:`auto-scheduler-conv-gpu`).
diff --git a/docs/_downloads/bcb4a24e8acc1ca84214bc8d7fb7954b/tune_conv2d_layer_cuda.ipynb b/docs/_downloads/bcb4a24e8acc1ca84214bc8d7fb7954b/tune_conv2d_layer_cuda.ipynb
index efd5141..c19a19c 100644
--- a/docs/_downloads/bcb4a24e8acc1ca84214bc8d7fb7954b/tune_conv2d_layer_cuda.ipynb
+++ b/docs/_downloads/bcb4a24e8acc1ca84214bc8d7fb7954b/tune_conv2d_layer_cuda.ipynb
@@ -69,7 +69,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "Next, we set parameters for the auto-scheduler. These parameters\nmainly specify how we do the measurement during the search and auto-tuning.\n\n* `measure_ctx` launches a different process for measurement. This\n  provides an isolation. It can protect the master process from GPU crashes\n  happended during measurement and avoid other runtime conflicts.\n* `min_repeat_ms` defines the minimum duration of one \"repeat\" in every measurement.\n  This can warmup the GPU, which is ne [...]
+        "Next, we set parameters for the auto-scheduler. These parameters\nmainly specify how we do the measurement during the search and auto-tuning.\n\n* :code:`measure_ctx` launches a different process for measurement. This\n  provides an isolation. It can protect the master process from GPU crashes\n  happended during measurement and avoid other runtime conflicts.\n* :code:`min_repeat_ms` defines the minimum duration of one \"repeat\" in every measurement.\n  This can warmup the GPU, [...]
       ]
     },
     {
diff --git a/docs/_downloads/f1a09967bab66114252357e4a9babb45/tune_matmul_x86.ipynb b/docs/_downloads/f1a09967bab66114252357e4a9babb45/tune_matmul_x86.ipynb
index 113b92f..e8dbcd6 100644
--- a/docs/_downloads/f1a09967bab66114252357e4a9babb45/tune_matmul_x86.ipynb
+++ b/docs/_downloads/f1a09967bab66114252357e4a9babb45/tune_matmul_x86.ipynb
@@ -51,7 +51,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "Create the search task\n^^^^^^^^^^^^^^^^^^^^^^\nWe then create a search task with N=L=M=128 and dtype=\"float32\"\nIf your machine supports avx instructions, you can\n- replace \"llvm\" below with \"llvm -mcpu=core-avx2\" to enable AVX2\n- replace \"llvm\" below with \"llvm -mcpu=skylake-avx512\" to enable AVX-512\n\n"
+        "Create the search task\n^^^^^^^^^^^^^^^^^^^^^^\nWe then create a search task with N=L=M=128 and dtype=\"float32\"\nIf your machine supports avx instructions, you can\n\n  - replace \"llvm\" below with \"llvm -mcpu=core-avx2\" to enable AVX2\n  - replace \"llvm\" below with \"llvm -mcpu=skylake-avx512\" to enable AVX-512\n\n"
       ]
     },
     {
@@ -62,14 +62,14 @@
       },
       "outputs": [],
       "source": [
-        "target = tvm.target.Target(\"llvm\")\ntask = auto_scheduler.create_task(matmul_add, (128, 128, 128, \"float32\"), target)\n\n# Inspect the computational graph\nprint(task.compute_dag)"
+        "target = tvm.target.Target(\"llvm\")\ntask = tvm.auto_scheduler.create_task(matmul_add, (128, 128, 128, \"float32\"), target)\n\n# Inspect the computational graph\nprint(task.compute_dag)"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "Next, we set parameters for the auto-scheduler.\n\n* `num_measure_trials` is the number of measurement trials we can use during the search.\n  We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a\n  good value for the search to converge. You can do more trials according to your time budget.\n* In addition, we use `RecordToFile` to dump measurement records into a file `matmul.json`.\n  The measurement records can be used to query the history be [...]
+        "Next, we set parameters for the auto-scheduler.\n\n* :code:`num_measure_trials` is the number of measurement trials we can use during the search.\n  We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a\n  good value for the search to converge. You can do more trials according to your time budget.\n* In addition, we use :code:`RecordToFile` to dump measurement records into a file `matmul.json`.\n  The measurement records can be used to query th [...]
       ]
     },
     {
@@ -184,7 +184,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "<div class=\"alert alert-info\"><h4>Note</h4><p>We cannot run the line above because of the conflict between\n  python's multiprocessing and tvm's thread pool.\n  After running a tvm generated binary the python's multiprocessing library\n  will hang forever. You have to make sure that you don't run any tvm\n  generated binaries before calling auot-scheduler's search.\n  To run the function above, you should comment out all code in\n  \"Check correctness and evaluate performance\ [...]
+        "<div class=\"alert alert-info\"><h4>Note</h4><p>We cannot run the line above because of the conflict between\n  python's multiprocessing and tvm's thread pool.\n  After running a tvm generated binary the python's multiprocessing library\n  will hang forever. You have to make sure that you don't run any tvm\n  generated binaries before calling auot-scheduler's search.\n  To run the function above, you should comment out all code in\n  \"Check correctness and evaluate performance\ [...]
       ]
     }
   ],
diff --git a/docs/_sources/api/python/auto_scheduler.rst.txt b/docs/_sources/api/python/auto_scheduler.rst.txt
index a7c190a..c5b8dcc 100644
--- a/docs/_sources/api/python/auto_scheduler.rst.txt
+++ b/docs/_sources/api/python/auto_scheduler.rst.txt
@@ -18,33 +18,7 @@
 tvm.auto_scheduler
 ------------------
 .. automodule:: tvm.auto_scheduler
+   :members:
+   :imported-members:
+   :autosummary:
 
-tvm.auto_scheduler.auto_schedule
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: tvm.auto_scheduler.auto_schedule
-
-.. autoclass:: tvm.auto_scheduler.auto_schedule.SearchTask
-
-.. autoclass:: tvm.auto_scheduler.auto_schedule.TuningOptions
-
-.. autofunction:: tvm.auto_scheduler.auto_schedule.create_task
-
-.. autofunction:: tvm.auto_scheduler.auto_schedule.auto_schedule
-
-tvm.auto_scheduler.workload_registry
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: tvm.auto_scheduler.workload_registry.register_workload
-
-
-tvm.auto_scheduler.measure
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: tvm.auto_scheduler.measure
-
-.. autoclass:: tvm.auto_scheduler.measure.LocalRPCMeasureContext
-
-.. autoclass:: tvm.auto_scheduler.measure.LocalRunner
-
-.. autoclass:: tvm.auto_scheduler.measure.LocalBuilder
-
-.. autoclass:: tvm.auto_scheduler.measure.RPCRunner
diff --git a/docs/_sources/tutorials/auto_scheduler/sg_execution_times.rst.txt b/docs/_sources/tutorials/auto_scheduler/sg_execution_times.rst.txt
index 953ce1c..1cea4be 100644
--- a/docs/_sources/tutorials/auto_scheduler/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/auto_scheduler/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
 
 Computation times
 =================
-**04:47.459** total execution time for **tutorials_auto_scheduler** files:
+**04:38.356** total execution time for **tutorials_auto_scheduler** files:
 
-- **02:51.342**: :ref:`sphx_glr_tutorials_auto_scheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``)
-- **01:56.117**: :ref:`sphx_glr_tutorials_auto_scheduler_tune_matmul_x86.py` (``tune_matmul_x86.py``)
+- **02:51.937**: :ref:`sphx_glr_tutorials_auto_scheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``)
+- **01:46.419**: :ref:`sphx_glr_tutorials_auto_scheduler_tune_matmul_x86.py` (``tune_matmul_x86.py``)
diff --git a/docs/_sources/tutorials/auto_scheduler/tune_conv2d_layer_cuda.rst.txt b/docs/_sources/tutorials/auto_scheduler/tune_conv2d_layer_cuda.rst.txt
index 2b9b6e8..45d0b6e 100644
--- a/docs/_sources/tutorials/auto_scheduler/tune_conv2d_layer_cuda.rst.txt
+++ b/docs/_sources/tutorials/auto_scheduler/tune_conv2d_layer_cuda.rst.txt
@@ -106,20 +106,20 @@ We then create a search task for the last convolution layer in the resnet.
 Next, we set parameters for the auto-scheduler. These parameters
 mainly specify how we do the measurement during the search and auto-tuning.
 
-* `measure_ctx` launches a different process for measurement. This
+* :code:`measure_ctx` launches a different process for measurement. This
   provides an isolation. It can protect the master process from GPU crashes
   happended during measurement and avoid other runtime conflicts.
-* `min_repeat_ms` defines the minimum duration of one "repeat" in every measurement.
+* :code:`min_repeat_ms` defines the minimum duration of one "repeat" in every measurement.
   This can warmup the GPU, which is necessary to get accurate measurement results.
   Typically, we recommend a value > 300 ms.
-* `num_measure_trials` is the number of measurement trials we can use during the search.
+* :code:`num_measure_trials` is the number of measurement trials we can use during the search.
   We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a
   good value for the search to converge. You can do more trials according to your time budget.
-* In addition, we use `RecordToFile` to dump measurement records into a file `conv2d.json`.
+* In addition, we use :code:`RecordToFile` to dump measurement records into a file `conv2d.json`.
   The measurement records can be used to query the history best, resume the search,
   and do more analyses later.
-* see :any:`auto_scheduler.auto_schedule.TuningOptions`:,
-  :any:`auto_scheduler.measure.LocalRPCMeasureContext` for more parameters.
+* see :any:`auto_scheduler.TuningOptions`,
+  :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters.
 
 
 .. code-block:: default
@@ -194,1110 +194,319 @@ cooperative fetching, unrolling and operator fusion.
 
     primfn(data_1: handle, kernel_1: handle, bias_1: handle, compute_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {bias: Buffer(bias_2: Pointer(float32), float32, [1, 512, 1, 1], []),
-                 compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], []),
+      buffers = {compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], []),
+                 bias: Buffer(bias_2: Pointer(float32), float32, [1, 512, 1, 1], []),
                  kernel: Buffer(kernel_2: Pointer(float32), float32, [512, 512, 3, 3], []),
                  data: Buffer(data_2: Pointer(float32), float32, [1, 512, 7, 7], [])}
       buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} {
-      attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 64;
+      attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 16;
       attr [compute_3: Pointer(float32)] "storage_scope" = "local";
-      allocate(compute_3, float32, [8]);
+      allocate(compute_3, float32, [14]);
       attr [pad_temp.shared: Pointer(float32)] "storage_scope" = "shared";
-      allocate(pad_temp.shared, float32, [1568]);
+      allocate(pad_temp.shared, float32, [162]);
       attr [kernel.shared: Pointer(float32)] "storage_scope" = "shared";
-      allocate(kernel.shared, float32, [256]);
-      attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
+      allocate(kernel.shared, float32, [576]);
+      attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {
         compute_3[0] = 0f32
-        compute_3[2] = 0f32
-        compute_3[4] = 0f32
-        compute_3[6] = 0f32
         compute_3[1] = 0f32
+        compute_3[2] = 0f32
         compute_3[3] = 0f32
+        compute_3[4] = 0f32
         compute_3[5] = 0f32
+        compute_3[6] = 0f32
         compute_3[7] = 0f32
-        for (rc.outer.outer: int32, 0, 16) {
-          for (ry.outer.outer: int32, 0, 3) {
-            attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[(threadIdx.x_1*2)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) - 8)], 0f32, dtype=float32)
-              pad_temp.shared[((threadIdx.x_1*2) + 1)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) - 8)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 98)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 90)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 98)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 90)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 196)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 188)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 196)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 188)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 294)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 286)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 294)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 286)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 392)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 384)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 392)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 384)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 490)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 482)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 490)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 482)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 588)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 580)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 588)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 580)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 686)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 678)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 686)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 678)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 784)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 776)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 784)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 776)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 882)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 874)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 882)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 874)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 980)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 972)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 980)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 972)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1078)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1070)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1078)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1070)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1176)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1168)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1176)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1168)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1274)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1266)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1274)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1266)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1372)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1364)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1372)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1364)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1470)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1462)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1470)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (1 <= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1462)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[(threadIdx.x_2*12)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2*12), 32)*4608)) + (rc.outer.outer*288)) + (floormod((threadIdx.x_2*12), 32)*9)) + (ry.outer.outer*3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 1)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 1), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 1), 32)*9)) + (ry.outer.outer*3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 2)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 2), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 2), 32)*9)) + (ry.outer.outer*3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 3)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 3), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 3), 32)*9)) + (ry.outer.outer*3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 4)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 4), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 4), 32)*9)) + (ry.outer.outer*3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 5)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 5), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 5), 32)*9)) + (ry.outer.outer*3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 6)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 6), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 6), 32)*9)) + (ry.outer.outer*3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 7)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 7), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 7), 32)*9)) + (ry.outer.outer*3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 8)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 8), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 8), 32)*9)) + (ry.outer.outer*3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 9)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 9), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 9), 32)*9)) + (ry.outer.outer*3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 10)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 10), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 10), 32)*9)) + (ry.outer.outer*3))]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 11)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 11), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 11), 32)*9)) + (ry.outer.outer*3))]
-              }
-            }
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[0]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[64]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[128]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[192]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[1]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[65]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[129]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[193]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[2]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[66]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[130]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[194]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[3]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[67]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[131]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[195]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[4]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[68]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[132]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[196]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[5]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[69]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[133]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[197]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[6]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[70]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[134]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[198]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[7]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[71]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[135]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[199]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[8]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[72]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[136]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[200]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[9]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[73]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[137]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[201]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[10]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[74]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[138]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[202]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[11]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[75]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[139]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[203]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[12]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[76]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[140]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[204]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[13]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[77]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[141]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[205]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[14]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[78]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[142]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[206]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[15]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[79]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[143]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[207]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[32]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[96]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[160]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[224]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[33]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[97]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[161]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[225]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[34]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[98]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[162]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[226]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[35]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[99]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[163]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[227]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[36]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[100]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[164]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[228]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[37]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[101]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[165]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[229]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[38]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[102]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[166]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[230]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[39]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[103]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[167]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[231]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[40]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[104]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[168]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[232]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[41]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[105]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[169]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[233]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[42]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[106]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[170]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[234]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[43]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[107]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[171]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[235]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[44]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[108]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[172]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[236]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[45]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[109]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[173]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[237]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[46]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[110]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[174]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[238]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[47]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[111]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[175]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[239]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[16]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[80]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[144]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[208]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[17]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[81]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[145]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[209]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[18]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[82]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[146]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[210]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[19]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[83]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[147]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[211]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[20]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[84]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[148]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[212]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[21]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[85]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[149]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[213]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[22]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[86]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[150]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[214]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[23]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[87]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[151]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[215]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[24]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[88]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[152]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[216]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[25]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[89]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[153]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[217]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[26]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[90]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[154]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[218]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[27]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[91]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[155]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[219]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[28]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[92]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[156]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[220]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[29]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[93]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[157]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[221]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[30]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[94]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[158]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[222]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[31]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[95]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[159]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[223]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[48]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[112]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[176]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[240]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[49]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[113]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[177]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[241]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[50]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[114]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[178]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[242]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[51]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[115]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[179]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[243]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[52]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[116]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[180]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[244]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[53]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[117]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[181]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[245]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[54]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[118]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[182]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[246]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[55]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[119]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[183]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[247]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[56]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[120]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[184]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[248]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[57]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[121]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[185]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[249]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[58]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[122]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[186]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[250]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[59]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[123]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[187]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[251]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[60]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[124]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[188]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[252]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[61]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[125]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[189]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[253]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[62]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[126]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[190]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[254]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[63]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[127]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[191]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[255]))
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[(threadIdx.x_1*2)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) - 7)], 0f32, dtype=float32)
-              pad_temp.shared[((threadIdx.x_1*2) + 1)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) - 7)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 98)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 91)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 98)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 91)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 196)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 189)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 196)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 189)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 294)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 287)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 294)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 287)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 392)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 385)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 392)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 385)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 490)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 483)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 490)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 483)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 588)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 581)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 588)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 581)], 0f32, dtype=float32)
+        compute_3[8] = 0f32
+        compute_3[9] = 0f32
+        compute_3[10] = 0f32
+        compute_3[11] = 0f32
+        compute_3[12] = 0f32
+        compute_3[13] = 0f32
+        for (rc.outer.outer: int32, 0, 256) {
+          attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112 {
+            if @tir.likely((threadIdx.x_1 < 41), dtype=bool) {
+              pad_temp.shared[(threadIdx.x_1*4)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1*4), 81)) && (floormod((threadIdx.x_1*4), 81) < 72)) && (1 <= floormod((threadIdx.x_1*4), 9))) && (floormod((threadIdx.x_1*4), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*98) + (floordiv((threadIdx.x_1*4), 81)*49)) + (floordiv(floormod((threadIdx.x_1*4), 81), 9)*7)) + floormod((threadIdx.x_1*4), 9)) - 8)], 0f32, dtype=float32)
+            }
+            if @tir.likely((threadIdx.x_1 < 41), dtype=bool) {
+              pad_temp.shared[((threadIdx.x_1*4) + 1)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 1), 81)) && (floormod(((threadIdx.x_1*4) + 1), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 1), 9))) && (floormod(((threadIdx.x_1*4) + 1), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*98) + (floordiv(((threadIdx.x_1*4) + 1), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 1), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 1), 9)) - 8)], 0f32, dtype=float32)
+            }
+            if @tir.likely((threadIdx.x_1 < 40), dtype=bool) {
+              pad_temp.shared[((threadIdx.x_1*4) + 2)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 2), 81)) && (floormod(((threadIdx.x_1*4) + 2), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 2), 9))) && (floormod(((threadIdx.x_1*4) + 2), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*98) + (floordiv(((threadIdx.x_1*4) + 2), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 2), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 2), 9)) - 8)], 0f32, dtype=float32)
+            }
+            if @tir.likely((threadIdx.x_1 < 40), dtype=bool) {
+              pad_temp.shared[((threadIdx.x_1*4) + 3)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 3), 81)) && (floormod(((threadIdx.x_1*4) + 3), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 3), 9))) && (floormod(((threadIdx.x_1*4) + 3), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*98) + (floordiv(((threadIdx.x_1*4) + 3), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 3), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 3), 9)) - 8)], 0f32, dtype=float32)
             }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 686)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 679)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 686)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 679)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 784)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 777)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 784)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 777)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 882)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 875)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 882)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 875)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 980)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 973)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 980)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 973)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1078)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1071)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1078)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1071)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1176)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1169)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1176)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1169)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1274)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1267)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1274)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1267)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1372)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1365)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1372)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1365)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1470)] = @tir.if_then_else(((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1463)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1470)] = @tir.if_then_else(((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1463)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[(threadIdx.x_2*12)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2*12), 32)*4608)) + (rc.outer.outer*288)) + (floormod((threadIdx.x_2*12), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 1)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 1), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 1), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 2)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 2), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 2), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 3)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 3), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 3), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 4)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 4), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 4), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 5)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 5), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 5), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 6)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 6), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 6), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 7)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 7), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 7), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 8)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 8), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 8), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 9)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 9), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 9), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 10)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 10), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 10), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 11)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 11), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 11), 32)*9)) + (ry.outer.outer*3)) + 1)]
-              }
-            }
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[0]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[64]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[128]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[192]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[1]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[65]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[129]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[193]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[2]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[66]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[130]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[194]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[3]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[67]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[131]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[195]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[4]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[68]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[132]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[196]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[5]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[69]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[133]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[197]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[6]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[70]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[134]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[198]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[7]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[71]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[135]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[199]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[8]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[72]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[136]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[200]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[9]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[73]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[137]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[201]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[10]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[74]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[138]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[202]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[11]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[75]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[139]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[203]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[12]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[76]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[140]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[204]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[13]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[77]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[141]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[205]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[14]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[78]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[142]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[206]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[15]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[79]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[143]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[207]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[32]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[96]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[160]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[224]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[33]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[97]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[161]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[225]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[34]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[98]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[162]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[226]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[35]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[99]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[163]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[227]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[36]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[100]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[164]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[228]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[37]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[101]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[165]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[229]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[38]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[102]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[166]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[230]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[39]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[103]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[167]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[231]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[40]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[104]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[168]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[232]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[41]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[105]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[169]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[233]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[42]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[106]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[170]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[234]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[43]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[107]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[171]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[235]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[44]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[108]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[172]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[236]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[45]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[109]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[173]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[237]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[46]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[110]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[174]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[238]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[47]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[111]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[175]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[239]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[16]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[80]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[144]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[208]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[17]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[81]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[145]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[209]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[18]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[82]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[146]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[210]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[19]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[83]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[147]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[211]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[20]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[84]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[148]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[212]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[21]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[85]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[149]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[213]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[22]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[86]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[150]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[214]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[23]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[87]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[151]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[215]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[24]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[88]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[152]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[216]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[25]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[89]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[153]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[217]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[26]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[90]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[154]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[218]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[27]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[91]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[155]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[219]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[28]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[92]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[156]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[220]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[29]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[93]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[157]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[221]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[30]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[94]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[158]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[222]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[31]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[95]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[159]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[223]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[48]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[112]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[176]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[240]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[49]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[113]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[177]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[241]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[50]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[114]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[178]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[242]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[51]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[115]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[179]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[243]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[52]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[116]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[180]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[244]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[53]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[117]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[181]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[245]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[54]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[118]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[182]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[246]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[55]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[119]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[183]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[247]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[56]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[120]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[184]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[248]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[57]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[121]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[185]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[249]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[58]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[122]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[186]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[250]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[59]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[123]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[187]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[251]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[60]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[124]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[188]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[252]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[61]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[125]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[189]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[253]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[62]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[126]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[190]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[254]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[63]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[127]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[191]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[255]))
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[(threadIdx.x_1*2)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) - 6)], 0f32, dtype=float32)
-              pad_temp.shared[((threadIdx.x_1*2) + 1)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) - 6)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 98)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 92)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 98)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 92)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 196)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 190)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 196)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 190)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 294)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 288)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 294)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 288)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 392)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 386)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 392)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 386)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 490)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 484)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 490)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 484)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 588)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 582)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 588)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 582)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 686)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 680)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 686)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 680)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 784)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 778)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 784)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 778)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 882)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 876)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 882)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 876)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 980)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 974)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 980)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 974)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1078)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1072)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1078)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1072)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1176)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1170)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1176)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1170)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1274)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1268)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1274)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1268)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1372)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1366)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1372)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1366)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              pad_temp.shared[((threadIdx.x_1*2) + 1470)] = @tir.if_then_else((((1 <= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) && ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) < 8)) && (floormod((threadIdx.x_1*2), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1464)], 0f32, dtype=float32)
-              pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1470)] = @tir.if_then_else((((1 <= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) && ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) < 8)) && (floormod(((threadIdx.x_1*2) + 1), 7) < 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1464)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[(threadIdx.x_2*12)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2*12), 32)*4608)) + (rc.outer.outer*288)) + (floormod((threadIdx.x_2*12), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 1)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 1), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 1), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 2)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 2), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 2), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-              if @tir.likely((threadIdx.x_2 < 22), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 3)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 3), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 3), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 4)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 4), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 4), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 5)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 5), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 5), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 6)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 6), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 6), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 7)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 7), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 7), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 8)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 8), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 8), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 9)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 9), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 9), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 10)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 10), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 10), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-              if @tir.likely((threadIdx.x_2 < 21), dtype=bool) {
-                kernel.shared[((threadIdx.x_2*12) + 11)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 11), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 11), 32)*9)) + (ry.outer.outer*3)) + 2)]
-              }
-            }
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[0]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[64]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[128]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[192]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[1]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[65]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[129]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[193]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[2]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[66]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[130]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[194]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[3]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[67]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[131]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[195]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[4]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[68]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[132]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[196]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[5]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[69]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[133]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[197]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[6]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[70]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[134]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[198]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[7]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[71]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[135]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[199]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[8]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[72]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[136]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[200]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[9]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[73]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[137]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[201]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[10]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[74]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[138]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[202]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[11]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[75]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[139]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[203]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[12]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[76]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[140]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[204]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[13]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[77]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[141]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[205]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[14]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[78]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[142]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[206]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[15]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[79]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[143]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[207]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[32]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[96]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[160]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[224]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[33]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[97]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[161]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[225]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[34]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[98]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[162]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[226]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[35]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[99]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[163]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[227]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[36]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[100]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[164]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[228]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[37]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[101]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[165]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[229]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[38]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[102]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[166]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[230]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[39]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[103]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[167]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[231]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[40]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[104]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[168]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[232]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[41]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[105]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[169]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[233]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[42]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[106]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[170]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[234]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[43]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[107]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[171]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[235]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[44]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[108]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[172]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[236]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[45]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[109]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[173]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[237]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[46]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[110]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[174]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[238]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[47]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[111]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[175]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[239]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[16]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[80]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[144]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[208]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[17]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[81]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[145]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[209]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[18]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[82]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[146]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[210]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[19]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[83]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[147]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[211]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[20]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[84]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[148]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[212]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[21]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[85]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[149]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[213]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[22]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[86]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[150]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[214]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[23]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[87]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[151]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[215]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[24]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[88]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[152]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[216]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[25]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[89]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[153]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[217]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[26]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[90]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[154]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[218]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[27]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[91]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[155]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[219]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[28]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[92]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[156]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[220]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[29]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[93]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[157]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[221]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[30]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[94]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[158]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[222]))
-            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[31]))
-            compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[95]))
-            compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[159]))
-            compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[223]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[48]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[112]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[176]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[240]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[49]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[113]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[177]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[241]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[50]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[114]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[178]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[242]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[51]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[115]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[179]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[243]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[52]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[116]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[180]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[244]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[53]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[117]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[181]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[245]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[54]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[118]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[182]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[246]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[55]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[119]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[183]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[247]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[56]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[120]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[184]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[248]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[57]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[121]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[185]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[249]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[58]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[122]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[186]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[250]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[59]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[123]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[187]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[251]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[60]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[124]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[188]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[252]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[61]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[125]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[189]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[253]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[62]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[126]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[190]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[254]))
-            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[63]))
-            compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[127]))
-            compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[191]))
-            compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[255]))
           }
+          attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
+          kernel.shared[threadIdx.x_2] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(threadIdx.x_2, 18)*4608)) + (rc.outer.outer*18)) + floormod(threadIdx.x_2, 18))]
+          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
+          kernel.shared[(threadIdx.x_2 + 112)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 112), 18)*4608)) + (rc.outer.outer*18)) + floormod((threadIdx.x_2 + 4), 18))]
+          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
+          kernel.shared[(threadIdx.x_2 + 224)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 224), 18)*4608)) + (rc.outer.outer*18)) + floormod((threadIdx.x_2 + 8), 18))]
+          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
+          kernel.shared[(threadIdx.x_2 + 336)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 336), 18)*4608)) + (rc.outer.outer*18)) + floormod((threadIdx.x_2 + 12), 18))]
+          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
+          kernel.shared[(threadIdx.x_2 + 448)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 448), 18)*4608)) + (rc.outer.outer*18)) + floormod((threadIdx.x_2 + 16), 18))]
+          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 112;
+          if @tir.likely((threadIdx.x_2 < 16), dtype=bool) {
+            kernel.shared[(threadIdx.x_2 + 560)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 560), 18)*4608)) + (rc.outer.outer*18)) + floormod((threadIdx.x_2 + 2), 18))]
+          }
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7)*9)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7)*9)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 7)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 7)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 7)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 8)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 7)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 8)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 9)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 9)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 16)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 16)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 16)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 17)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 16)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 17)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 18)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 18)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 25)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 25)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 25)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 26)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 25)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 26)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 81)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 82)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 81)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 82)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 82)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 88)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 82)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 88)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 88)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 89)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 88)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 89)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 90)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 91)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 90)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 91)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 91)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 97)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 91)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 97)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 97)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 98)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 97)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 98)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 99)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 100)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 99)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 100)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 100)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 106)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 100)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 106)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 106)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 107)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+          compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+          compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+          compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+          compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+          compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+          compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 106)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+          compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 107)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
         }
         for (i1.inner: int32, 0, 2) {
-          compute_2[(((blockIdx.x*392) + (i1.inner*49)) + threadIdx.x)] = max(((float32*)compute_3[i1.inner] + (float32*)bias_2[((blockIdx.x*8) + i1.inner)]), 0f32)
-          compute_2[((((blockIdx.x*392) + (i1.inner*49)) + threadIdx.x) + 98)] = max(((float32*)compute_3[(i1.inner + 2)] + (float32*)bias_2[(((blockIdx.x*8) + i1.inner) + 2)]), 0f32)
-          compute_2[((((blockIdx.x*392) + (i1.inner*49)) + threadIdx.x) + 196)] = max(((float32*)compute_3[(i1.inner + 4)] + (float32*)bias_2[(((blockIdx.x*8) + i1.inner) + 4)]), 0f32)
-          compute_2[((((blockIdx.x*392) + (i1.inner*49)) + threadIdx.x) + 294)] = max(((float32*)compute_3[(i1.inner + 6)] + (float32*)bias_2[(((blockIdx.x*8) + i1.inner) + 6)]), 0f32)
+          for (i3.inner: int32, 0, 7) {
+            compute_2[(((((blockIdx.x*1568) + (floordiv(threadIdx.x, 7)*98)) + (i1.inner*49)) + (floormod(threadIdx.x, 7)*7)) + i3.inner)] = max(((float32*)compute_3[((i1.inner*7) + i3.inner)] + (float32*)bias_2[(((blockIdx.x*32) + (floordiv(threadIdx.x, 7)*2)) + i1.inner)]), 0f32)
+          }
         }
       }
     }
@@ -1350,7 +559,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 0.322 ms
+    Execution time of this operator: 0.356 ms
 
 
 
@@ -1402,34 +611,34 @@ print the equivalent python schedule API, and build the binary again.
     nn_o_o_o_o, nn_o_o_o_i = s[compute].split(nn_o_o_o_i, factor=1)
     ff_o_i, ff_i = s[compute].split(ff, factor=1)
     ff_o_o_i, ff_o_i = s[compute].split(ff_o_i, factor=1)
-    ff_o_o_o_i, ff_o_o_i = s[compute].split(ff_o_o_i, factor=32)
+    ff_o_o_o_i, ff_o_o_i = s[compute].split(ff_o_o_i, factor=16)
     ff_o_o_o_o, ff_o_o_o_i = s[compute].split(ff_o_o_o_i, factor=1)
-    yy_o_i, yy_i = s[compute].split(yy, factor=7)
-    yy_o_o_i, yy_o_i = s[compute].split(yy_o_i, factor=1)
+    yy_o_i, yy_i = s[compute].split(yy, factor=1)
+    yy_o_o_i, yy_o_i = s[compute].split(yy_o_i, factor=7)
     yy_o_o_o_i, yy_o_o_i = s[compute].split(yy_o_o_i, factor=1)
     yy_o_o_o_o, yy_o_o_o_i = s[compute].split(yy_o_o_o_i, factor=1)
     xx_o_i, xx_i = s[compute].split(xx, factor=1)
     xx_o_o_i, xx_o_i = s[compute].split(xx_o_i, factor=1)
-    xx_o_o_o_i, xx_o_o_i = s[compute].split(xx_o_o_i, factor=1)
+    xx_o_o_o_i, xx_o_o_i = s[compute].split(xx_o_o_i, factor=7)
     xx_o_o_o_o, xx_o_o_o_i = s[compute].split(xx_o_o_o_i, factor=1)
-    rc_o_i, rc_i = s[compute].split(rc, factor=16)
-    rc_o_o, rc_o_i = s[compute].split(rc_o_i, factor=1)
+    rc_o_i, rc_i = s[compute].split(rc, factor=8)
+    rc_o_o, rc_o_i = s[compute].split(rc_o_i, factor=2)
     ry_o_i, ry_i = s[compute].split(ry, factor=3)
     ry_o_o, ry_o_i = s[compute].split(ry_o_i, factor=1)
-    rx_o_i, rx_i = s[compute].split(rx, factor=3)
-    rx_o_o, rx_o_i = s[compute].split(rx_o_i, factor=1)
+    rx_o_i, rx_i = s[compute].split(rx, factor=1)
+    rx_o_o, rx_o_i = s[compute].split(rx_o_i, factor=3)
     s[compute].reorder(nn_o_o_o_o, ff_o_o_o_o, yy_o_o_o_o, xx_o_o_o_o, nn_o_o_o_i, ff_o_o_o_i, yy_o_o_o_i, xx_o_o_o_i, nn_o_o_i, ff_o_o_i, yy_o_o_i, xx_o_o_i, rc_o_o, ry_o_o, rx_o_o, rc_o_i, ry_o_i, rx_o_i, nn_o_i, ff_o_i, yy_o_i, xx_o_i, rc_i, ry_i, rx_i, nn_i, ff_i, yy_i, xx_i)
     i0_o_i, i0_i = s[compute].split(i0, factor=1)
     i0_o_o_i, i0_o_i = s[compute].split(i0_o_i, factor=1)
     i0_o_o_o, i0_o_o_i = s[compute].split(i0_o_o_i, factor=1)
     i1_o_i, i1_i = s[compute].split(i1, factor=1)
-    i1_o_o_i, i1_o_i = s[compute].split(i1_o_i, factor=32)
+    i1_o_o_i, i1_o_i = s[compute].split(i1_o_i, factor=16)
     i1_o_o_o, i1_o_o_i = s[compute].split(i1_o_o_i, factor=1)
     i2_o_i, i2_i = s[compute].split(i2, factor=7)
     i2_o_o_i, i2_o_i = s[compute].split(i2_o_i, factor=1)
     i2_o_o_o, i2_o_o_i = s[compute].split(i2_o_o_i, factor=1)
     i3_o_i, i3_i = s[compute].split(i3, factor=1)
-    i3_o_o_i, i3_o_i = s[compute].split(i3_o_i, factor=1)
+    i3_o_o_i, i3_o_i = s[compute].split(i3_o_i, factor=7)
     i3_o_o_o, i3_o_o_i = s[compute].split(i3_o_o_i, factor=1)
     s[compute].reorder(i0_o_o_o, i1_o_o_o, i2_o_o_o, i3_o_o_o, i0_o_o_i, i1_o_o_i, i2_o_o_i, i3_o_o_i, i0_o_i, i1_o_i, i2_o_i, i3_o_i, i0_i, i1_i, i2_i, i3_i)
     s[compute].compute_at(s[compute], i3_o_i)
@@ -1447,14 +656,14 @@ print the equivalent python schedule API, and build the binary again.
     i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(i0_o_i, i1_o_i, i2_o_i, i3_o_i)
     s[compute].bind(i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, tvm.thread_axis("threadIdx.x"))
     ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(ax0, ax1, ax2, ax3)
-    ax0_ax1_fused_ax2_fused_ax3_fused_o, ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
+    ax0_ax1_fused_ax2_fused_ax3_fused_o, ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused, factor=3)
     s[kernel_shared].vectorize(ax0_ax1_fused_ax2_fused_ax3_fused_i)
-    ax0_ax1_fused_ax2_fused_ax3_fused_o_o, ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=32)
+    ax0_ax1_fused_ax2_fused_ax3_fused_o_o, ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)
     s[kernel_shared].bind(ax0_ax1_fused_ax2_fused_ax3_fused_o_i, tvm.thread_axis("threadIdx.x"))
     ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(ax0, ax1, ax2, ax3)
-    ax0_ax1_fused_ax2_fused_ax3_fused_o, ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
+    ax0_ax1_fused_ax2_fused_ax3_fused_o, ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused, factor=4)
     s[pad_temp_shared].vectorize(ax0_ax1_fused_ax2_fused_ax3_fused_i)
-    ax0_ax1_fused_ax2_fused_ax3_fused_o_o, ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=32)
+    ax0_ax1_fused_ax2_fused_ax3_fused_o_o, ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)
     s[pad_temp_shared].bind(ax0_ax1_fused_ax2_fused_ax3_fused_o_i, tvm.thread_axis("threadIdx.x"))
     s[compute].pragma(nn_o_o_o_o, "auto_unroll_max_step", 64)
     s[compute].pragma(nn_o_o_o_o, "unroll_explicit", True)
@@ -1504,7 +713,7 @@ In the example below we resume the status and do more 5 trials.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  51.342 seconds)
+   **Total running time of the script:** ( 2 minutes  51.937 seconds)
 
 
 .. _sphx_glr_download_tutorials_auto_scheduler_tune_conv2d_layer_cuda.py:
diff --git a/docs/_sources/tutorials/auto_scheduler/tune_matmul_x86.rst.txt b/docs/_sources/tutorials/auto_scheduler/tune_matmul_x86.rst.txt
index bb6d6d0..f31c415 100644
--- a/docs/_sources/tutorials/auto_scheduler/tune_matmul_x86.rst.txt
+++ b/docs/_sources/tutorials/auto_scheduler/tune_matmul_x86.rst.txt
@@ -68,15 +68,16 @@ Create the search task
 ^^^^^^^^^^^^^^^^^^^^^^
 We then create a search task with N=L=M=128 and dtype="float32"
 If your machine supports avx instructions, you can
-- replace "llvm" below with "llvm -mcpu=core-avx2" to enable AVX2
-- replace "llvm" below with "llvm -mcpu=skylake-avx512" to enable AVX-512
+
+  - replace "llvm" below with "llvm -mcpu=core-avx2" to enable AVX2
+  - replace "llvm" below with "llvm -mcpu=skylake-avx512" to enable AVX-512
 
 
 .. code-block:: default
 
 
     target = tvm.target.Target("llvm")
-    task = auto_scheduler.create_task(matmul_add, (128, 128, 128, "float32"), target)
+    task = tvm.auto_scheduler.create_task(matmul_add, (128, 128, 128, "float32"), target)
 
     # Inspect the computational graph
     print(task.compute_dag)
@@ -102,13 +103,13 @@ If your machine supports avx instructions, you can
 
 Next, we set parameters for the auto-scheduler.
 
-* `num_measure_trials` is the number of measurement trials we can use during the search.
+* :code:`num_measure_trials` is the number of measurement trials we can use during the search.
   We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a
   good value for the search to converge. You can do more trials according to your time budget.
-* In addition, we use `RecordToFile` to dump measurement records into a file `matmul.json`.
+* In addition, we use :code:`RecordToFile` to dump measurement records into a file `matmul.json`.
   The measurement records can be used to query the history best, resume the search,
   and do more analyses later.
-* see :any:`auto_scheduler.auto_schedule.TuningOptions`: for more parameters
+* see :any:`auto_scheduler.TuningOptions` for more parameters
 
 
 .. code-block:: default
@@ -146,7 +147,7 @@ After some measurement trials, it will return the best schedule it found.
 
  .. code-block:: none
 
-    *T*T*T*T*T*T*T*T*T*T
+    *T*T*T*T*T*T*T*T*T
 
 
 
@@ -173,23 +174,345 @@ parallelization, vectorization, unrolling and operator fusion.
     primfn(A_1: handle, B_1: handle, C_1: handle, out_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
       buffers = {out: Buffer(out_2: Pointer(float32), float32, [128, 128], []),
-                 B: Buffer(B_2: Pointer(float32), float32, [128, 128], []),
                  C: Buffer(C_2: Pointer(float32), float32, [128, 128], []),
+                 B: Buffer(B_2: Pointer(float32), float32, [128, 128], []),
                  A: Buffer(A_2: Pointer(float32), float32, [128, 128], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C, out_1: out} {
       attr [matmul: Pointer(float32)] "storage_scope" = "global";
       allocate(matmul, float32, [16384]) {
-        for (i: int32, 0, 128) {
-          for (j: int32, 0, 128) {
-            matmul[((i*128) + j)] = 0f32
-            for (k: int32, 0, 128) {
-              matmul[((i*128) + j)] = ((float32*)matmul[((i*128) + j)] + ((float32*)A_2[((i*128) + k)]*(float32*)B_2[((k*128) + j)]))
+        for (i.outer.outer.inner: int32, 0, 8) {
+          for (j.outer.outer.inner: int32, 0, 4) {
+            matmul[ramp(((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 128), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 256), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 384), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 512), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 640), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 768), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 896), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 2), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 130), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 258), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 386), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 514), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 642), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 770), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 898), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 4), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 132), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 260), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 388), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 516), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 644), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 772), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 900), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 6), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 134), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 262), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 390), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 518), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 646), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 774), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 902), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 8), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 136), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 264), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 392), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 520), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 648), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 776), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 904), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 10), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 138), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 266), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 394), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 522), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 650), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 778), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 906), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 12), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 140), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 268), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 396), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 524), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 652), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 780), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 908), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 14), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 142), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 270), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 398), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 526), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 654), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 782), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 910), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 16), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 144), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 272), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 400), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 528), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 656), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 784), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 912), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 18), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 146), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 274), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 402), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 530), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 658), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 786), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 914), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 20), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 148), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 276), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 404), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 532), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 660), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 788), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 916), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 22), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 150), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 278), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 406), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 534), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 662), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 790), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 918), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 24), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 152), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 280), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 408), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 536), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 664), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 792), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 920), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 26), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 154), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 282), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 410), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 538), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 666), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 794), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 922), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 28), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 156), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 284), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 412), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 540), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 668), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 796), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 924), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 30), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 158), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 286), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 414), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 542), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 670), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 798), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 926), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1024), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1152), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1280), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1408), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1536), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1664), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1792), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1920), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1026), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1154), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1282), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1410), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1538), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1666), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1794), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1922), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1028), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1156), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1284), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1412), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1540), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1668), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1796), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1924), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1030), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1158), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1286), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1414), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1542), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1670), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1798), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1926), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1032), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1160), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1288), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1416), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1544), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1672), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1800), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1928), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1034), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1162), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1290), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1418), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1546), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1674), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1802), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1930), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1036), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1164), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1292), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1420), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1548), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1676), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1804), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1932), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1038), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1166), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1294), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1422), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1550), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1678), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1806), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1934), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1040), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1168), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1296), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1424), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1552), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1680), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1808), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1936), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1042), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1170), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1298), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1426), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1554), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1682), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1810), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1938), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1044), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1172), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1300), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1428), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1556), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1684), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1812), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1940), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1046), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1174), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1302), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1430), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1558), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1686), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1814), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1942), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1048), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1176), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1304), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1432), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1560), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1688), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1816), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1944), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1050), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1178), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1306), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1434), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1562), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1690), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1818), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1946), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1052), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1180), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1308), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1436), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1564), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1692), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1820), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1948), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1054), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1182), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1310), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1438), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1566), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1694), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1822), 1, 2)] = broadcast(0f32, 2)
+            matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1950), 1, 2)] = broadcast(0f32, 2)
+            for (k.outer: int32, 0, 16) {
+              for (i.outer.inner: int32, 0, 2) {
+                for (j.outer.inner: int32, 0, 16) {
+                  matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[(((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8))], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 128)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 256)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 384)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 512)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 640)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 768)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 896)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+                  matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 1)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 129)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 257)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 385)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 513)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 641)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 769)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 897)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 2)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 130)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 258)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 386)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 514)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 642)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 770)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 898)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 3)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 131)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 259)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 387)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 515)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 643)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 771)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 899)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 4)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 132)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 260)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 388)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 516)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 644)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 772)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 900)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 5)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 133)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 261)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 389)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 517)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 645)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 773)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 901)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 6)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 134)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 262)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 390)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 518)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 646)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 774)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 902)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 7)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)]))
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 135)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 263)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 391)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 519)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 647)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 775)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                  matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 903)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + [...]
+                }
+              }
             }
           }
         }
-        for (i_1: int32, 0, 128) {
-          for (j_1: int32, 0, 128) {
-            out_2[((i_1*128) + j_1)] = ((float32*)matmul[((i_1*128) + j_1)] + (float32*)C_2[((i_1*128) + j_1)])
+        for (i.inner: int32, 0, 128) {
+          for (j.inner: int32, 0, 128) {
+            out_2[((i.inner*128) + j.inner)] = ((float32*)matmul[((i.inner*128) + j.inner)] + (float32*)C_2[((i.inner*128) + j.inner)])
           }
         }
       }
@@ -241,7 +564,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 2.217 ms
+    Execution time of this operator: 0.371 ms
 
 
 
@@ -352,13 +675,13 @@ In the example below we resume the status and do more 5 trials.
   For example, you can start a new thread/process (with the builtin python library
   threading or multiprocessing) and run the tvm binaries in the new thread/process.
   This provides an isolation and avoids the conflict in the main thread/process.
-  You can also use :any:`auto_scheduler.measure.LocalRPCMeasureContext` for auto-scheduler,
+  You can also use :any:`auto_scheduler.LocalRPCMeasureContext` for auto-scheduler,
   as shown in the GPU tutorial (:ref:`auto-scheduler-conv-gpu`).
 
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  56.117 seconds)
+   **Total running time of the script:** ( 1 minutes  46.419 seconds)
 
 
 .. _sphx_glr_download_tutorials_auto_scheduler_tune_matmul_x86.py:
diff --git a/docs/_sources/tutorials/autotvm/sg_execution_times.rst.txt b/docs/_sources/tutorials/autotvm/sg_execution_times.rst.txt
index 11d0442..36d6afa 100644
--- a/docs/_sources/tutorials/autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/autotvm/sg_execution_times.rst.txt
@@ -5,11 +5,11 @@
 
 Computation times
 =================
-**01:17.575** total execution time for **tutorials_autotvm** files:
-
-- **00:49.825**: :ref:`sphx_glr_tutorials_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)
-- **00:27.115**: :ref:`sphx_glr_tutorials_autotvm_tune_simple_template.py` (``tune_simple_template.py``)
-- **00:00.184**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_cuda.py` (``tune_relay_cuda.py``)
-- **00:00.159**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)
-- **00:00.147**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_arm.py` (``tune_relay_arm.py``)
-- **00:00.144**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``)
+**01:10.287** total execution time for **tutorials_autotvm** files:
+
+- **00:45.110**: :ref:`sphx_glr_tutorials_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)
+- **00:24.539**: :ref:`sphx_glr_tutorials_autotvm_tune_simple_template.py` (``tune_simple_template.py``)
+- **00:00.186**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_cuda.py` (``tune_relay_cuda.py``)
+- **00:00.157**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)
+- **00:00.148**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_arm.py` (``tune_relay_arm.py``)
+- **00:00.146**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``)
diff --git a/docs/_sources/tutorials/autotvm/tune_conv2d_cuda.rst.txt b/docs/_sources/tutorials/autotvm/tune_conv2d_cuda.rst.txt
index 0e4f44b..676a57d 100644
--- a/docs/_sources/tutorials/autotvm/tune_conv2d_cuda.rst.txt
+++ b/docs/_sources/tutorials/autotvm/tune_conv2d_cuda.rst.txt
@@ -238,26 +238,26 @@ for this template
        7 unroll_explicit: OtherOption([0, 1]) len=2
     )
     Get devices for measurement successfully!
-    No: 1   GFLOPS: 27.74/27.74     result: MeasureResult(costs=(0.00834437025,), error_no=0, all_cost=2.8114962577819824, timestamp=1600569712.3554745)    [('tile_f', [-1, 32, 1, 2]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 1]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7166780
-    No: 2   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 3   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 4   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 5   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 6   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 7   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 8   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 9   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 10  GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 11  GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 12  GFLOPS: 46.00/46.00     result: MeasureResult(costs=(0.005032405272727272,), error_no=0, all_cost=2.980912923812866, timestamp=1600569724.2990882)      [('tile_f', [-1, 2, 8, 2]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 32]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2077980
-    No: 13  GFLOPS: 0.00/46.00      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 14  GFLOPS: 73.67/73.67     result: MeasureResult(costs=(0.00314226628125,), error_no=0, all_cost=1.9068870544433594, timestamp=1600569726.168562)  [('tile_f', [-1, 2, 16, 8]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 16, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,8726459
-    No: 15  GFLOPS: 27.61/73.67     result: MeasureResult(costs=(0.008385226583333334,), error_no=0, all_cost=1.822760820388794, timestamp=1600569727.4046636)      [('tile_f', [-1, 1, 2, 64]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 8]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,5905444
-    No: 16  GFLOPS: 1.61/73.67      result: MeasureResult(costs=(0.14341517075,), error_no=0, all_cost=4.74316668510437, timestamp=1600569730.2062669)      [('tile_f', [-1, 2, 8, 8]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 2, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7428895
-    No: 17  GFLOPS: 0.00/73.67      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 18  GFLOPS: 0.00/73.67      result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (5) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (4) /workspace/build/libtvm.so(+0x11d5e42) [0x7f3a69a31e42]\n  [bt] (3) /workspace/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x26b) [0x7f3a69a3343b]\n  [bt] (2) /workspace/build/libtvm.so(tvm::runtime::RPCClientSession::Call [...]
-    No: 19  GFLOPS: 23.75/73.67     result: MeasureResult(costs=(0.00974691818181818,), error_no=0, all_cost=1.496830701828003, timestamp=1600569739.6011925)       [('tile_f', [-1, 2, 1, 32]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,782066
-    No: 20  GFLOPS: 0.00/73.67      result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (5) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (4) /workspace/build/libtvm.so(+0x11d5e42) [0x7f3a69a31e42]\n  [bt] (3) /workspace/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x26b) [0x7f3a69a3343b]\n  [bt] (2) /workspace/build/libtvm.so(tvm::runtime::RPCClientSession::Call [...]
+    No: 1   GFLOPS: 27.65/27.65     result: MeasureResult(costs=(0.008373972333333334,), error_no=0, all_cost=2.972865104675293, timestamp=1600758792.8104715)      [('tile_f', [-1, 32, 1, 2]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 2, 1]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7166780
+    No: 2   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 3   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 4   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 5   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 6   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 7   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 8   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 9   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 10  GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 11  GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 12  GFLOPS: 49.34/49.34     result: MeasureResult(costs=(0.0046919395,), error_no=0, all_cost=2.9751179218292236, timestamp=1600758802.322572)      [('tile_f', [-1, 2, 8, 2]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 32]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2077980
+    No: 13  GFLOPS: 0.00/49.34      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 14  GFLOPS: 73.63/73.63     result: MeasureResult(costs=(0.00314397021875,), error_no=0, all_cost=2.0785014629364014, timestamp=1600758803.7681682) [('tile_f', [-1, 2, 16, 8]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 16, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,8726459
+    No: 15  GFLOPS: 27.59/73.63     result: MeasureResult(costs=(0.008390514333333333,), error_no=0, all_cost=1.9435279369354248, timestamp=1600758804.8366861)     [('tile_f', [-1, 1, 2, 64]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 8]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,5905444
+    No: 16  GFLOPS: 1.61/73.63      result: MeasureResult(costs=(0.14342895375,), error_no=0, all_cost=4.801366806030273, timestamp=1600758807.475919)      [('tile_f', [-1, 2, 8, 8]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 2, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7428895
+    No: 17  GFLOPS: 0.00/73.63      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 18  GFLOPS: 0.00/73.63      result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (5) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (4) /workspace/build/libtvm.so(+0x11d6002) [0x7f233404c002]\n  [bt] (3) /workspace/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x26b) [0x7f233404d5fb]\n  [bt] (2) /workspace/build/libtvm.so(tvm::runtime::RPCClientSession::Call [...]
+    No: 19  GFLOPS: 23.75/73.63     result: MeasureResult(costs=(0.009745905545454545,), error_no=0, all_cost=1.7060630321502686, timestamp=1600758815.8534865)     [('tile_f', [-1, 2, 1, 32]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,782066
+    No: 20  GFLOPS: 0.00/73.63      result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (5) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (4) /workspace/build/libtvm.so(+0x11d6002) [0x7f233404c002]\n  [bt] (3) /workspace/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x26b) [0x7f233404d5fb]\n  [bt] (2) /workspace/build/libtvm.so(tvm::runtime::RPCClientSession::Call [...]
 
 
 
@@ -310,7 +310,7 @@ and measure running time.
 
     Best config:
     [('tile_f', [-1, 2, 16, 8]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 16, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,8726459
-    Time cost of this operator: 0.003474
+    Time cost of this operator: 0.003496
 
 
 
diff --git a/docs/_sources/tutorials/autotvm/tune_simple_template.rst.txt b/docs/_sources/tutorials/autotvm/tune_simple_template.rst.txt
index 2ad2306..9752376 100644
--- a/docs/_sources/tutorials/autotvm/tune_simple_template.rst.txt
+++ b/docs/_sources/tutorials/autotvm/tune_simple_template.rst.txt
@@ -365,16 +365,16 @@ used to get the best config later.
  .. code-block:: none
 
     Get devices for measurement successfully!
-    No: 1   GFLOPS: 9.96/9.96       result: MeasureResult(costs=(0.0269546244,), error_no=0, all_cost=0.927971601486206, timestamp=1600569684.2742639)      [('tile_y', [-1, 8]), ('tile_x', [-1, 32])],None,53
-    No: 2   GFLOPS: 12.65/12.65     result: MeasureResult(costs=(0.0212222168,), error_no=0, all_cost=1.3642914295196533, timestamp=1600569685.455414)      [('tile_y', [-1, 128]), ('tile_x', [-1, 256])],None,87
-    No: 3   GFLOPS: 15.22/15.22     result: MeasureResult(costs=(0.017637051,), error_no=0, all_cost=1.055131435394287, timestamp=1600569686.5537252)       [('tile_y', [-1, 8]), ('tile_x', [-1, 512])],None,93
-    No: 4   GFLOPS: 13.09/15.22     result: MeasureResult(costs=(0.0205073562,), error_no=0, all_cost=1.1743948459625244, timestamp=1600569687.7479143)     [('tile_y', [-1, 128]), ('tile_x', [-1, 512])],None,97
-    No: 5   GFLOPS: 2.02/15.22      result: MeasureResult(costs=(0.1327765914,), error_no=0, all_cost=2.800631523132324, timestamp=1600569690.720167)       [('tile_y', [-1, 256]), ('tile_x', [-1, 4])],None,28
-    No: 6   GFLOPS: 8.86/15.22      result: MeasureResult(costs=(0.030290215199999998,), error_no=0, all_cost=1.296706199645996, timestamp=1600569692.0395942)      [('tile_y', [-1, 4]), ('tile_x', [-1, 32])],None,52
-    No: 7   GFLOPS: 13.83/15.22     result: MeasureResult(costs=(0.0194060902,), error_no=0, all_cost=0.917316198348999, timestamp=1600569693.2027879)      [('tile_y', [-1, 2]), ('tile_x', [-1, 512])],None,91
-    No: 8   GFLOPS: 11.73/15.22     result: MeasureResult(costs=(0.022882502800000003,), error_no=0, all_cost=1.1356632709503174, timestamp=1600569694.3774312)     [('tile_y', [-1, 2]), ('tile_x', [-1, 256])],None,81
-    No: 9   GFLOPS: 0.92/15.22      result: MeasureResult(costs=(0.291634,), error_no=0, all_cost=5.390047550201416, timestamp=1600569701.2650788)  [('tile_y', [-1, 128]), ('tile_x', [-1, 2])],None,17
-    No: 10  GFLOPS: 1.19/15.22      result: MeasureResult(costs=(0.22557004639999997,), error_no=0, all_cost=4.4250712394714355, timestamp=1600569705.7239919)      [('tile_y', [-1, 1]), ('tile_x', [-1, 2])],None,10
+    No: 1   GFLOPS: 9.77/9.77       result: MeasureResult(costs=(0.0274771002,), error_no=0, all_cost=1.644925594329834, timestamp=1600758767.2463927)      [('tile_y', [-1, 8]), ('tile_x', [-1, 32])],None,53
+    No: 2   GFLOPS: 12.61/12.61     result: MeasureResult(costs=(0.021286801799999998,), error_no=0, all_cost=0.8907041549682617, timestamp=1600758768.201701)      [('tile_y', [-1, 128]), ('tile_x', [-1, 256])],None,87
+    No: 3   GFLOPS: 15.62/15.62     result: MeasureResult(costs=(0.0171844216,), error_no=0, all_cost=1.0766642093658447, timestamp=1600758769.0675318)     [('tile_y', [-1, 8]), ('tile_x', [-1, 512])],None,93
+    No: 4   GFLOPS: 13.08/15.62     result: MeasureResult(costs=(0.0205201322,), error_no=0, all_cost=1.000988483428955, timestamp=1600758769.9721282)      [('tile_y', [-1, 128]), ('tile_x', [-1, 512])],None,97
+    No: 5   GFLOPS: 1.99/15.62      result: MeasureResult(costs=(0.13515661699999998,), error_no=0, all_cost=3.0966920852661133, timestamp=1600758772.747954)       [('tile_y', [-1, 256]), ('tile_x', [-1, 4])],None,28
+    No: 6   GFLOPS: 8.94/15.62      result: MeasureResult(costs=(0.030038061999999997,), error_no=0, all_cost=1.2703056335449219, timestamp=1600758773.7996676)     [('tile_y', [-1, 4]), ('tile_x', [-1, 32])],None,52
+    No: 7   GFLOPS: 13.82/15.62     result: MeasureResult(costs=(0.0194217882,), error_no=0, all_cost=1.1055903434753418, timestamp=1600758774.7236273)     [('tile_y', [-1, 2]), ('tile_x', [-1, 512])],None,91
+    No: 8   GFLOPS: 12.04/15.62     result: MeasureResult(costs=(0.022300372800000003,), error_no=0, all_cost=1.1945393085479736, timestamp=1600758775.6967962)     [('tile_y', [-1, 2]), ('tile_x', [-1, 256])],None,81
+    No: 9   GFLOPS: 0.92/15.62      result: MeasureResult(costs=(0.2910877202,), error_no=0, all_cost=5.471558094024658, timestamp=1600758782.301078)       [('tile_y', [-1, 128]), ('tile_x', [-1, 2])],None,17
+    No: 10  GFLOPS: 1.21/15.62      result: MeasureResult(costs=(0.221818162,), error_no=0, all_cost=4.466007947921753, timestamp=1600758786.428074)        [('tile_y', [-1, 1]), ('tile_x', [-1, 2])],None,10
 
 
 
diff --git a/docs/_sources/tutorials/dev/sg_execution_times.rst.txt b/docs/_sources/tutorials/dev/sg_execution_times.rst.txt
index e48f228..11eef20 100644
--- a/docs/_sources/tutorials/dev/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/dev/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
 
 Computation times
 =================
-**00:00.556** total execution time for **tutorials_dev** files:
+**00:00.566** total execution time for **tutorials_dev** files:
 
-- **00:00.371**: :ref:`sphx_glr_tutorials_dev_use_pass_infra.py` (``use_pass_infra.py``)
-- **00:00.186**: :ref:`sphx_glr_tutorials_dev_low_level_custom_pass.py` (``low_level_custom_pass.py``)
+- **00:00.376**: :ref:`sphx_glr_tutorials_dev_use_pass_infra.py` (``use_pass_infra.py``)
+- **00:00.189**: :ref:`sphx_glr_tutorials_dev_low_level_custom_pass.py` (``low_level_custom_pass.py``)
diff --git a/docs/_sources/tutorials/frontend/deploy_model_on_android.rst.txt b/docs/_sources/tutorials/frontend/deploy_model_on_android.rst.txt
index 77b7acd..c5a3836 100644
--- a/docs/_sources/tutorials/frontend/deploy_model_on_android.rst.txt
+++ b/docs/_sources/tutorials/frontend/deploy_model_on_android.rst.txt
@@ -421,7 +421,7 @@ Execute on TVM
 
     TVM prediction top-1: tiger cat
     Evaluate inference time cost...
-    Mean inference time (std dev): 13.37 ms (0.12 ms)
+    Mean inference time (std dev): 14.01 ms (1.59 ms)
 
 
 
diff --git a/docs/_sources/tutorials/frontend/deploy_object_detection_pytorch.rst.txt b/docs/_sources/tutorials/frontend/deploy_object_detection_pytorch.rst.txt
index f1b2e16..62130a9 100644
--- a/docs/_sources/tutorials/frontend/deploy_object_detection_pytorch.rst.txt
+++ b/docs/_sources/tutorials/frontend/deploy_object_detection_pytorch.rst.txt
@@ -237,7 +237,7 @@ Get boxes with score larger than 0.9
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  47.919 seconds)
+   **Total running time of the script:** ( 1 minutes  48.525 seconds)
 
 
 .. _sphx_glr_download_tutorials_frontend_deploy_object_detection_pytorch.py:
diff --git a/docs/_sources/tutorials/frontend/deploy_prequantized.rst.txt b/docs/_sources/tutorials/frontend/deploy_prequantized.rst.txt
index 22d63ee..3fcd84b 100644
--- a/docs/_sources/tutorials/frontend/deploy_prequantized.rst.txt
+++ b/docs/_sources/tutorials/frontend/deploy_prequantized.rst.txt
@@ -348,7 +348,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
  .. code-block:: none
 
-    Elapsed average ms: 19.295815570000002
+    Elapsed average ms: 19.28641578
 
 
 
diff --git a/docs/_sources/tutorials/frontend/deploy_prequantized_tflite.rst.txt b/docs/_sources/tutorials/frontend/deploy_prequantized_tflite.rst.txt
index 4416257..b2c80c4 100644
--- a/docs/_sources/tutorials/frontend/deploy_prequantized_tflite.rst.txt
+++ b/docs/_sources/tutorials/frontend/deploy_prequantized_tflite.rst.txt
@@ -368,7 +368,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
  .. code-block:: none
 
-    Elapsed average ms: 36.139228329999995
+    Elapsed average ms: 36.101913149999994
 
 
 
@@ -401,7 +401,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  38.399 seconds)
+   **Total running time of the script:** ( 2 minutes  37.247 seconds)
 
 
 .. _sphx_glr_download_tutorials_frontend_deploy_prequantized_tflite.py:
diff --git a/docs/_sources/tutorials/frontend/deploy_ssd_gluoncv.rst.txt b/docs/_sources/tutorials/frontend/deploy_ssd_gluoncv.rst.txt
index 7903ba3..94be0f9 100644
--- a/docs/_sources/tutorials/frontend/deploy_ssd_gluoncv.rst.txt
+++ b/docs/_sources/tutorials/frontend/deploy_ssd_gluoncv.rst.txt
@@ -319,7 +319,7 @@ Display result
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  54.180 seconds)
+   **Total running time of the script:** ( 1 minutes  55.271 seconds)
 
 
 .. _sphx_glr_download_tutorials_frontend_deploy_ssd_gluoncv.py:
diff --git a/docs/_sources/tutorials/frontend/from_onnx.rst.txt b/docs/_sources/tutorials/frontend/from_onnx.rst.txt
index 970d72e..535a85a 100644
--- a/docs/_sources/tutorials/frontend/from_onnx.rst.txt
+++ b/docs/_sources/tutorials/frontend/from_onnx.rst.txt
@@ -156,7 +156,7 @@ Execute on TVM
 
  .. code-block:: none
 
-
    ...47%, 0.01 MB, 38 KB/s, 0 seconds passed
    ...94%, 0.02 MB, 74 KB/s, 0 seconds passed
    ...100%, 0.02 MB, 111 KB/s, 0 seconds passed
+
    ...47%, 0.01 MB, 214 KB/s, 0 seconds passed
    ...94%, 0.02 MB, 361 KB/s, 0 seconds passed
    ...100%, 0.02 MB, 539 KB/s, 0 seconds passed
     Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 224, 224), 'float32'), ('TENSOR', (9, 32, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
     Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 224, 224), 'float32'), ('TENSOR', (32, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
     Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 1, 224, 224), 'float32'), ('TENSOR', (64, 1, 5, 5), 'float32'), (1, 1), (2, 2, 2, 2), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
diff --git a/docs/_sources/tutorials/frontend/sg_execution_times.rst.txt b/docs/_sources/tutorials/frontend/sg_execution_times.rst.txt
index e21d80b..60fcb09 100644
--- a/docs/_sources/tutorials/frontend/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/frontend/sg_execution_times.rst.txt
@@ -5,24 +5,24 @@
 
 Computation times
 =================
-**10:35.452** total execution time for **tutorials_frontend** files:
+**10:36.775** total execution time for **tutorials_frontend** files:
 
-- **02:38.399**: :ref:`sphx_glr_tutorials_frontend_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)
-- **01:54.180**: :ref:`sphx_glr_tutorials_frontend_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)
-- **01:47.919**: :ref:`sphx_glr_tutorials_frontend_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``)
-- **00:38.947**: :ref:`sphx_glr_tutorials_frontend_deploy_prequantized.py` (``deploy_prequantized.py``)
-- **00:37.533**: :ref:`sphx_glr_tutorials_frontend_from_tensorflow.py` (``from_tensorflow.py``)
-- **00:31.184**: :ref:`sphx_glr_tutorials_frontend_deploy_quantized.py` (``deploy_quantized.py``)
-- **00:26.233**: :ref:`sphx_glr_tutorials_frontend_from_tflite.py` (``from_tflite.py``)
-- **00:22.877**: :ref:`sphx_glr_tutorials_frontend_from_darknet.py` (``from_darknet.py``)
-- **00:16.766**: :ref:`sphx_glr_tutorials_frontend_from_caffe2.py` (``from_caffe2.py``)
-- **00:15.025**: :ref:`sphx_glr_tutorials_frontend_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)
-- **00:14.010**: :ref:`sphx_glr_tutorials_frontend_deploy_model_on_android.py` (``deploy_model_on_android.py``)
-- **00:11.676**: :ref:`sphx_glr_tutorials_frontend_from_keras.py` (``from_keras.py``)
-- **00:11.624**: :ref:`sphx_glr_tutorials_frontend_from_pytorch.py` (``from_pytorch.py``)
-- **00:09.566**: :ref:`sphx_glr_tutorials_frontend_from_coreml.py` (``from_coreml.py``)
-- **00:08.809**: :ref:`sphx_glr_tutorials_frontend_from_mxnet.py` (``from_mxnet.py``)
-- **00:05.507**: :ref:`sphx_glr_tutorials_frontend_build_gcn.py` (``build_gcn.py``)
-- **00:03.012**: :ref:`sphx_glr_tutorials_frontend_using_external_lib.py` (``using_external_lib.py``)
-- **00:02.012**: :ref:`sphx_glr_tutorials_frontend_from_onnx.py` (``from_onnx.py``)
-- **00:00.173**: :ref:`sphx_glr_tutorials_frontend_deploy_sparse.py` (``deploy_sparse.py``)
+- **02:37.247**: :ref:`sphx_glr_tutorials_frontend_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)
+- **01:55.271**: :ref:`sphx_glr_tutorials_frontend_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)
+- **01:48.525**: :ref:`sphx_glr_tutorials_frontend_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``)
+- **00:39.296**: :ref:`sphx_glr_tutorials_frontend_deploy_prequantized.py` (``deploy_prequantized.py``)
+- **00:37.281**: :ref:`sphx_glr_tutorials_frontend_from_tensorflow.py` (``from_tensorflow.py``)
+- **00:31.319**: :ref:`sphx_glr_tutorials_frontend_deploy_quantized.py` (``deploy_quantized.py``)
+- **00:26.102**: :ref:`sphx_glr_tutorials_frontend_from_tflite.py` (``from_tflite.py``)
+- **00:22.942**: :ref:`sphx_glr_tutorials_frontend_from_darknet.py` (``from_darknet.py``)
+- **00:16.702**: :ref:`sphx_glr_tutorials_frontend_from_caffe2.py` (``from_caffe2.py``)
+- **00:15.132**: :ref:`sphx_glr_tutorials_frontend_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)
+- **00:14.016**: :ref:`sphx_glr_tutorials_frontend_deploy_model_on_android.py` (``deploy_model_on_android.py``)
+- **00:11.667**: :ref:`sphx_glr_tutorials_frontend_from_pytorch.py` (``from_pytorch.py``)
+- **00:11.593**: :ref:`sphx_glr_tutorials_frontend_from_keras.py` (``from_keras.py``)
+- **00:09.698**: :ref:`sphx_glr_tutorials_frontend_from_mxnet.py` (``from_mxnet.py``)
+- **00:09.579**: :ref:`sphx_glr_tutorials_frontend_from_coreml.py` (``from_coreml.py``)
+- **00:05.577**: :ref:`sphx_glr_tutorials_frontend_build_gcn.py` (``build_gcn.py``)
+- **00:02.951**: :ref:`sphx_glr_tutorials_frontend_using_external_lib.py` (``using_external_lib.py``)
+- **00:01.709**: :ref:`sphx_glr_tutorials_frontend_from_onnx.py` (``from_onnx.py``)
+- **00:00.168**: :ref:`sphx_glr_tutorials_frontend_deploy_sparse.py` (``deploy_sparse.py``)
diff --git a/docs/_sources/tutorials/get_started/cross_compilation_and_rpc.rst.txt b/docs/_sources/tutorials/get_started/cross_compilation_and_rpc.rst.txt
index 8d8f6eb..7718e38 100644
--- a/docs/_sources/tutorials/get_started/cross_compilation_and_rpc.rst.txt
+++ b/docs/_sources/tutorials/get_started/cross_compilation_and_rpc.rst.txt
@@ -235,7 +235,7 @@ device and returns the measured cost. Network overhead is excluded.
 
  .. code-block:: none
 
-    1.179e-07 secs/op
+    1.186e-07 secs/op
 
 
 
diff --git a/docs/_sources/tutorials/get_started/relay_quick_start.rst.txt b/docs/_sources/tutorials/get_started/relay_quick_start.rst.txt
index cd28ece..1b735ed 100644
--- a/docs/_sources/tutorials/get_started/relay_quick_start.rst.txt
+++ b/docs/_sources/tutorials/get_started/relay_quick_start.rst.txt
@@ -224,7 +224,7 @@ in this example. Then the machine code will be generated as the module library.
 
  .. code-block:: none
 
-
    ...1%, 0.01 MB, 37 KB/s, 0 seconds passed
    ...3%, 0.02 MB, 72 KB/s, 0 seconds passed
    ...5%, 0.02 MB, 108 KB/s, 0 seconds passed
    ...7%, 0.03 MB, 143 KB/s, 0 seconds passed
    ...9%, 0.04 MB, 178 KB/s, 0 seconds passed
    ...11%, 0.05 MB, 208 KB/s, 0 seconds passed
    ...13%, 0.05 MB, 241 KB/s, 0 seconds passed
    ...15%, 0.06 MB, 275 KB/s, 0 seconds passed
    ...17%, 0.07 MB, 308 KB/s, 0 seconds passed
    ...19%, 0.08 MB, 341 KB/s, 0 seconds passed
    ...21%, 0.09 MB, 374 KB/s, 0 seconds passed
    ...23%, 0.09 MB, 400 KB/s, 0 seconds passed
    ...25%, 0.10 MB, 434 KB/s, 0 seconds passed
    ...27%, 0.11 MB, 467 KB/s, 0 seconds passed
    ...29%, 0.12 MB, 499 KB/s, 0 seconds passed
    ...31%, 0.12 MB, 530 KB/s, 0 seconds passed
    ...33%, 0.13 MB, 563 KB/s, 0 seconds passed
    ...35%, 0.14 MB, 594 KB/s, 0 seconds passed
    ...37%, 0.15 MB, 627 KB/s, 0 seconds passed
    ...39%, 0.16 MB, 657 KB/s, 0 seconds passed
    ...41%, 0.16 MB, 690 KB/s, 0 seconds pa
 ssed
    ...43%, 0.17 MB, 720 KB/s, 0 seconds passed
    ...45%, 0.18 MB, 751 KB/s, 0 seconds passed
    ...47%, 0.19 MB, 783 KB/s, 0 seconds passed
    ...49%, 0.20 MB, 805 KB/s, 0 seconds passed
    ...51%, 0.20 MB, 835 KB/s, 0 seconds passed
    ...53%, 0.21 MB, 862 KB/s, 0 seconds passed
    ...55%, 0.22 MB, 894 KB/s, 0 seconds passed
    ...57%, 0.23 MB, 922 KB/s, 0 seconds passed
    ...59%, 0.23 MB, 954 KB/s, 0 seconds passed
    ...61%, 0.24 MB, 979 KB/s, 0 seconds passed
    ...63%, 0.25 MB, 1011 KB/s, 0 seconds passed
    ...65%, 0.26 MB, 1039 KB/s, 0 seconds passed
    ...67%, 0.27 MB, 1070 KB/s, 0 seconds passed
    ...69%, 0.27 MB, 1097 KB/s, 0 seconds passed
    ...71%, 0.28 MB, 1128 KB/s, 0 seconds passed
    ...73%, 0.29 MB, 1150 KB/s, 0 seconds passed
    ...75%, 0.30 MB, 1180 KB/s, 0 seconds passed
    ...77%, 0.30 MB, 1205 KB/s, 0 seconds passed
    ...79%, 0.31 MB, 1235 KB/s, 0 seconds passed
    ...81%, 0.32 MB, 1262 KB/s, 0 seconds passed
    ...83%, 0.33 MB, 1
 293 KB/s, 0 seconds passed
    ...85%, 0.34 MB, 1317 KB/s, 0 seconds passed
    ...87%, 0.34 MB, 1347 KB/s, 0 seconds passed
    ...89%, 0.35 MB, 1373 KB/s, 0 seconds passed
    ...91%, 0.36 MB, 1403 KB/s, 0 seconds passed
    ...93%, 0.37 MB, 1426 KB/s, 0 seconds passed
    ...95%, 0.38 MB, 1456 KB/s, 0 seconds passed
    ...97%, 0.38 MB, 1481 KB/s, 0 seconds passed
    ...99%, 0.39 MB, 1511 KB/s, 0 seconds passed
    ...100%, 0.40 MB, 1540 KB/s, 0 seconds passed
+
    ...1%, 0.01 MB, 180 KB/s, 0 seconds passed
    ...3%, 0.02 MB, 303 KB/s, 0 seconds passed
    ...5%, 0.02 MB, 454 KB/s, 0 seconds passed
    ...7%, 0.03 MB, 600 KB/s, 0 seconds passed
    ...9%, 0.04 MB, 728 KB/s, 0 seconds passed
    ...11%, 0.05 MB, 775 KB/s, 0 seconds passed
    ...13%, 0.05 MB, 892 KB/s, 0 seconds passed
    ...15%, 0.06 MB, 1006 KB/s, 0 seconds passed
    ...17%, 0.07 MB, 1119 KB/s, 0 seconds passed
    ...19%, 0.08 MB, 1241 KB/s, 0 seconds passed
    ...21%, 0.09 MB, 1349 KB/s, 0 seconds passed
    ...23%, 0.09 MB, 1365 KB/s, 0 seconds passed
    ...25%, 0.10 MB, 1476 KB/s, 0 seconds passed
    ...27%, 0.11 MB, 1582 KB/s, 0 seconds passed
    ...29%, 0.12 MB, 1676 KB/s, 0 seconds passed
    ...31%, 0.12 MB, 1785 KB/s, 0 seconds passed
    ...33%, 0.13 MB, 1892 KB/s, 0 seconds passed
    ...35%, 0.14 MB, 1986 KB/s, 0 seconds passed
    ...37%, 0.15 MB, 2077 KB/s, 0 seconds passed
    ...39%, 0.16 MB, 2183 KB/s, 0 seconds passed
    ...41%, 0.16 MB, 2265 KB
 /s, 0 seconds passed
    ...43%, 0.17 MB, 2369 KB/s, 0 seconds passed
    ...45%, 0.18 MB, 2474 KB/s, 0 seconds passed
    ...47%, 0.19 MB, 2559 KB/s, 0 seconds passed
    ...49%, 0.20 MB, 2661 KB/s, 0 seconds passed
    ...51%, 0.20 MB, 2628 KB/s, 0 seconds passed
    ...53%, 0.21 MB, 2713 KB/s, 0 seconds passed
    ...55%, 0.22 MB, 2811 KB/s, 0 seconds passed
    ...57%, 0.23 MB, 2882 KB/s, 0 seconds passed
    ...59%, 0.23 MB, 2978 KB/s, 0 seconds passed
    ...61%, 0.24 MB, 3075 KB/s, 0 seconds passed
    ...63%, 0.25 MB, 3172 KB/s, 0 seconds passed
    ...65%, 0.26 MB, 3244 KB/s, 0 seconds passed
    ...67%, 0.27 MB, 3339 KB/s, 0 seconds passed
    ...69%, 0.27 MB, 3433 KB/s, 0 seconds passed
    ...71%, 0.28 MB, 3529 KB/s, 0 seconds passed
    ...73%, 0.29 MB, 3600 KB/s, 0 seconds passed
    ...75%, 0.30 MB, 3693 KB/s, 0 seconds passed
    ...77%, 0.30 MB, 3786 KB/s, 0 seconds passed
    ...79%, 0.31 MB, 3880 KB/s, 0 seconds passed
    ...81%, 0.32 MB, 3950 KB/s, 0 seconds pas
 sed
    ...83%, 0.33 MB, 4042 KB/s, 0 seconds passed
    ...85%, 0.34 MB, 4134 KB/s, 0 seconds passed
    ...87%, 0.34 MB, 4227 KB/s, 0 seconds passed
    ...89%, 0.35 MB, 4286 KB/s, 0 seconds passed
    ...91%, 0.36 MB, 4378 KB/s, 0 seconds passed
    ...93%, 0.37 MB, 4468 KB/s, 0 seconds passed
    ...95%, 0.38 MB, 4559 KB/s, 0 seconds passed
    ...97%, 0.38 MB, 4624 KB/s, 0 seconds passed
    ...99%, 0.39 MB, 4713 KB/s, 0 seconds passed
    ...100%, 0.40 MB, 4798 KB/s, 0 seconds passed
     Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32, workload=('dense_small_batch.cuda', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
 
 
diff --git a/docs/_sources/tutorials/get_started/sg_execution_times.rst.txt b/docs/_sources/tutorials/get_started/sg_execution_times.rst.txt
index 17bf649..74fdf7f 100644
--- a/docs/_sources/tutorials/get_started/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/get_started/sg_execution_times.rst.txt
@@ -5,8 +5,8 @@
 
 Computation times
 =================
-**00:16.580** total execution time for **tutorials_get_started** files:
+**00:16.361** total execution time for **tutorials_get_started** files:
 
-- **00:16.096**: :ref:`sphx_glr_tutorials_get_started_relay_quick_start.py` (``relay_quick_start.py``)
+- **00:15.875**: :ref:`sphx_glr_tutorials_get_started_relay_quick_start.py` (``relay_quick_start.py``)
 - **00:00.350**: :ref:`sphx_glr_tutorials_get_started_tensor_expr_get_started.py` (``tensor_expr_get_started.py``)
-- **00:00.134**: :ref:`sphx_glr_tutorials_get_started_cross_compilation_and_rpc.py` (``cross_compilation_and_rpc.py``)
+- **00:00.136**: :ref:`sphx_glr_tutorials_get_started_cross_compilation_and_rpc.py` (``cross_compilation_and_rpc.py``)
diff --git a/docs/_sources/tutorials/language/schedule_primitives.rst.txt b/docs/_sources/tutorials/language/schedule_primitives.rst.txt
index b5102ed..958aef9 100644
--- a/docs/_sources/tutorials/language/schedule_primitives.rst.txt
+++ b/docs/_sources/tutorials/language/schedule_primitives.rst.txt
@@ -85,13 +85,13 @@ schedule computes tensor in a serial manner in a row-major order.
 
     primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type="auto"),
-                 B: Buffer(B_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto"),
+      buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type="auto"),
+                 C: Buffer(C_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto"),
                  A: Buffer(A_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type="auto")}
       buffer_map = {A_1: A, B_1: B, C_1: C} {
       for (i: int32, 0, m) {
         for (j: int32, 0, n) {
-          C_2[((i*stride) + (j*stride_1))] = ((float32*)A_2[((i*stride_4) + (j*stride_5))]*(float32*)B_2[((i*stride_2) + (j*stride_3))])
+          C_2[((i*stride_2) + (j*stride_3))] = ((float32*)A_2[((i*stride_4) + (j*stride_5))]*(float32*)B_2[((i*stride) + (j*stride_1))])
         }
       }
     }
@@ -449,13 +449,13 @@ of computation of `C`.
 
     primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32], [stride: int32], type="auto"),
-                 B: Buffer(B_2: Pointer(float32), float32, [m], [stride_1: int32], type="auto"),
+      buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32], [stride: int32], type="auto"),
+                 C: Buffer(C_2: Pointer(float32), float32, [m], [stride_1: int32], type="auto"),
                  A: Buffer(A_2: Pointer(float32), float32, [m], [stride_2: int32], type="auto")}
       buffer_map = {A_1: A, B_1: B, C_1: C} {
       for (i: int32, 0, m) {
-        B_2[(i*stride_1)] = ((float32*)A_2[(i*stride_2)] + 1f32)
-        C_2[(i*stride)] = ((float32*)B_2[(i*stride_1)]*2f32)
+        B_2[(i*stride)] = ((float32*)A_2[(i*stride_2)] + 1f32)
+        C_2[(i*stride_1)] = ((float32*)B_2[(i*stride)]*2f32)
       }
     }
 
@@ -492,12 +492,12 @@ tensor is required.
 
     primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32], [stride: int32], type="auto"),
-                 C: Buffer(C_2: Pointer(float32), float32, [m], [stride_1: int32], type="auto"),
+      buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32], [stride: int32], type="auto"),
+                 B: Buffer(B_2: Pointer(float32), float32, [m], [stride_1: int32], type="auto"),
                  A: Buffer(A_2: Pointer(float32), float32, [m], [stride_2: int32], type="auto")}
       buffer_map = {A_1: A, B_1: B, C_1: C} {
       for (i: int32, 0, m) {
-        C_2[(i*stride_1)] = (((float32*)A_2[(i*stride_2)] + 1f32)*2f32)
+        C_2[(i*stride)] = (((float32*)A_2[(i*stride_2)] + 1f32)*2f32)
       }
     }
 
diff --git a/docs/_sources/tutorials/language/sg_execution_times.rst.txt b/docs/_sources/tutorials/language/sg_execution_times.rst.txt
index 4bbb3f9..bda4706 100644
--- a/docs/_sources/tutorials/language/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/language/sg_execution_times.rst.txt
@@ -5,13 +5,13 @@
 
 Computation times
 =================
-**00:04.698** total execution time for **tutorials_language** files:
+**00:04.710** total execution time for **tutorials_language** files:
 
-- **00:01.809**: :ref:`sphx_glr_tutorials_language_intrin_math.py` (``intrin_math.py``)
-- **00:00.864**: :ref:`sphx_glr_tutorials_language_tensorize.py` (``tensorize.py``)
+- **00:01.825**: :ref:`sphx_glr_tutorials_language_intrin_math.py` (``intrin_math.py``)
+- **00:00.870**: :ref:`sphx_glr_tutorials_language_tensorize.py` (``tensorize.py``)
 - **00:00.624**: :ref:`sphx_glr_tutorials_language_scan.py` (``scan.py``)
 - **00:00.597**: :ref:`sphx_glr_tutorials_language_reduction.py` (``reduction.py``)
 - **00:00.252**: :ref:`sphx_glr_tutorials_language_extern_op.py` (``extern_op.py``)
-- **00:00.212**: :ref:`sphx_glr_tutorials_language_schedule_primitives.py` (``schedule_primitives.py``)
-- **00:00.186**: :ref:`sphx_glr_tutorials_language_tedd.py` (``tedd.py``)
-- **00:00.154**: :ref:`sphx_glr_tutorials_language_tuple_inputs.py` (``tuple_inputs.py``)
+- **00:00.211**: :ref:`sphx_glr_tutorials_language_schedule_primitives.py` (``schedule_primitives.py``)
+- **00:00.181**: :ref:`sphx_glr_tutorials_language_tedd.py` (``tedd.py``)
+- **00:00.150**: :ref:`sphx_glr_tutorials_language_tuple_inputs.py` (``tuple_inputs.py``)
diff --git a/docs/_sources/tutorials/language/tensorize.rst.txt b/docs/_sources/tutorials/language/tensorize.rst.txt
index bf04573..83d8daa 100644
--- a/docs/_sources/tutorials/language/tensorize.rst.txt
+++ b/docs/_sources/tutorials/language/tensorize.rst.txt
@@ -119,8 +119,8 @@ Thus we break down the matmul loops to make the innermost loops a (16x64) GEMV.
 
     primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {B: Buffer(B_2: Pointer(float32), float32, [512, 64], []),
-                 C: Buffer(C_2: Pointer(float32), float32, [1024, 512], []),
+      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 512], []),
+                 B: Buffer(B_2: Pointer(float32), float32, [512, 64], []),
                  A: Buffer(A_2: Pointer(float32), float32, [1024, 64], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C} {
       for (i: int32, 0, 1024) {
@@ -312,8 +312,8 @@ The importing needs to happen before the tensorized GEMV being executed.
                  B: Buffer(B_2: Pointer(float32), float32, [512, 64], []),
                  A: Buffer(A_2: Pointer(float32), float32, [1024, 64], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C} {
-      attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmpp_vwn3cy/input0.cc'
-    source_filename = "/tmp/tmpp_vwn3cy/input0.cc"
+      attr [IterVar(i: int32, (nullptr), "DataPar", "")] "pragma_import_llvm" = "; ModuleID = '/tmp/tmp4nmf4z66/input0.cc'
+    source_filename = "/tmp/tmp4nmf4z66/input0.cc"
     target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
     target triple = "x86_64-pc-linux-gnu"
 
diff --git a/docs/_sources/tutorials/language/tuple_inputs.rst.txt b/docs/_sources/tutorials/language/tuple_inputs.rst.txt
index 0e0c849..5ea0303 100644
--- a/docs/_sources/tutorials/language/tuple_inputs.rst.txt
+++ b/docs/_sources/tutorials/language/tuple_inputs.rst.txt
@@ -65,14 +65,14 @@ together in the next schedule procedure.
     primfn(A0_1: handle, A1_1: handle, B.v0_1: handle, B.v1_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
       buffers = {B.v1: Buffer(B.v1_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type="auto"),
-                 B.v0: Buffer(B.v0_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto"),
-                 A1: Buffer(A1_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type="auto"),
+                 A1: Buffer(A1_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto"),
+                 B.v0: Buffer(B.v0_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type="auto"),
                  A0: Buffer(A0_2: Pointer(float32), float32, [m, n], [stride_6: int32, stride_7: int32], type="auto")}
       buffer_map = {A0_1: A0, A1_1: A1, B.v0_1: B.v0, B.v1_1: B.v1} {
       for (i: int32, 0, m) {
         for (j: int32, 0, n) {
-          B.v0_2[((i*stride_2) + (j*stride_3))] = ((float32*)A0_2[((i*stride_6) + (j*stride_7))] + 2f32)
-          B.v1_2[((i*stride) + (j*stride_1))] = ((float32*)A1_2[((i*stride_4) + (j*stride_5))]*3f32)
+          B.v0_2[((i*stride_4) + (j*stride_5))] = ((float32*)A0_2[((i*stride_6) + (j*stride_7))] + 2f32)
+          B.v1_2[((i*stride) + (j*stride_1))] = ((float32*)A1_2[((i*stride_2) + (j*stride_3))]*3f32)
         }
       }
     }
@@ -136,16 +136,16 @@ with :py:func:`te.comm_reducer` as below:
     primfn(idx_1: handle, val_1: handle, T.v0_1: handle, T.v1_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
       buffers = {T.v0: Buffer(T.v0_2: Pointer(int32), int32, [m: int32], [stride: int32], type="auto"),
-                 val: Buffer(val_2: Pointer(int32), int32, [m, n: int32], [stride_1: int32, stride_2: int32], type="auto"),
-                 T.v1: Buffer(T.v1_2: Pointer(int32), int32, [m], [stride_3: int32], type="auto"),
+                 T.v1: Buffer(T.v1_2: Pointer(int32), int32, [m], [stride_1: int32], type="auto"),
+                 val: Buffer(val_2: Pointer(int32), int32, [m, n: int32], [stride_2: int32, stride_3: int32], type="auto"),
                  idx: Buffer(idx_2: Pointer(int32), int32, [m, n], [stride_4: int32, stride_5: int32], type="auto")}
       buffer_map = {idx_1: idx, val_1: val, T.v0_1: T.v0, T.v1_1: T.v1} {
       for (i: int32, 0, m) {
         T.v0_2[(i*stride)] = -1
-        T.v1_2[(i*stride_3)] = -2147483648
+        T.v1_2[(i*stride_1)] = -2147483648
         for (k: int32, 0, n) {
-          T.v0_2[(i*stride)] = @tir.if_then_else(((int32*)val_2[((i*stride_1) + (k*stride_2))] <= (int32*)T.v1_2[(i*stride_3)]), (int32*)T.v0_2[(i*stride)], (int32*)idx_2[((i*stride_4) + (k*stride_5))], dtype=int32)
-          T.v1_2[(i*stride_3)] = @tir.if_then_else(((int32*)val_2[((i*stride_1) + (k*stride_2))] <= (int32*)T.v1_2[(i*stride_3)]), (int32*)T.v1_2[(i*stride_3)], (int32*)val_2[((i*stride_1) + (k*stride_2))], dtype=int32)
+          T.v0_2[(i*stride)] = @tir.if_then_else(((int32*)val_2[((i*stride_2) + (k*stride_3))] <= (int32*)T.v1_2[(i*stride_1)]), (int32*)T.v0_2[(i*stride)], (int32*)idx_2[((i*stride_4) + (k*stride_5))], dtype=int32)
+          T.v1_2[(i*stride_1)] = @tir.if_then_else(((int32*)val_2[((i*stride_2) + (k*stride_3))] <= (int32*)T.v1_2[(i*stride_1)]), (int32*)T.v1_2[(i*stride_1)], (int32*)val_2[((i*stride_2) + (k*stride_3))], dtype=int32)
         }
       }
     }
@@ -193,8 +193,8 @@ in terms of operation.
 
     primfn(A0_1: handle, A1_1: handle, C_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {A1: Buffer(A1_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type="auto"),
-                 C: Buffer(C_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto"),
+      buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type="auto"),
+                 A1: Buffer(A1_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto"),
                  A0: Buffer(A0_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type="auto")}
       buffer_map = {A0_1: A0, A1_1: A1, C_1: C} {
       attr [B.v0: Pointer(float32)] "storage_scope" = "global";
@@ -207,7 +207,7 @@ in terms of operation.
           B.v1[j] = ((float32*)A0_2[((i*stride_4) + (j*stride_5))]*3f32)
         }
         for (j_1: int32, 0, n) {
-          C_2[((i*stride_2) + (j_1*stride_3))] = ((float32*)A1_2[((i*stride) + (j_1*stride_1))] + (float32*)B.v0[j_1])
+          C_2[((i*stride) + (j_1*stride_1))] = ((float32*)A1_2[((i*stride_2) + (j_1*stride_3))] + (float32*)B.v0[j_1])
         }
       }
     }
diff --git a/docs/_sources/tutorials/micro/sg_execution_times.rst.txt b/docs/_sources/tutorials/micro/sg_execution_times.rst.txt
index 8c69056..54b442b 100644
--- a/docs/_sources/tutorials/micro/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/micro/sg_execution_times.rst.txt
@@ -5,6 +5,6 @@
 
 Computation times
 =================
-**00:09.958** total execution time for **tutorials_micro** files:
+**00:09.992** total execution time for **tutorials_micro** files:
 
-- **00:09.958**: :ref:`sphx_glr_tutorials_micro_micro_tflite.py` (``micro_tflite.py``)
+- **00:09.992**: :ref:`sphx_glr_tutorials_micro_micro_tflite.py` (``micro_tflite.py``)
diff --git a/docs/_sources/tutorials/optimize/opt_conv_cuda.rst.txt b/docs/_sources/tutorials/optimize/opt_conv_cuda.rst.txt
index 744512c..d4abccf 100644
--- a/docs/_sources/tutorials/optimize/opt_conv_cuda.rst.txt
+++ b/docs/_sources/tutorials/optimize/opt_conv_cuda.rst.txt
@@ -296,7 +296,7 @@ latency of convolution.
 
  .. code-block:: none
 
-    Convolution: 53.235596 ms
+    Convolution: 53.272718 ms
 
 
 
diff --git a/docs/_sources/tutorials/optimize/opt_conv_tensorcore.rst.txt b/docs/_sources/tutorials/optimize/opt_conv_tensorcore.rst.txt
index 00aafce..26f940c 100644
--- a/docs/_sources/tutorials/optimize/opt_conv_tensorcore.rst.txt
+++ b/docs/_sources/tutorials/optimize/opt_conv_tensorcore.rst.txt
@@ -405,8 +405,8 @@ one time.
 
     primfn(A_1: handle, W_1: handle, Conv_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {W: Buffer(W_2: Pointer(float16), float16, [3, 3, 16, 32, 16, 16], []),
-                 Conv: Buffer(Conv_2: Pointer(float32), float32, [16, 14, 14, 32, 16, 16], []),
+      buffers = {Conv: Buffer(Conv_2: Pointer(float32), float32, [16, 14, 14, 32, 16, 16], []),
+                 W: Buffer(W_2: Pointer(float16), float16, [3, 3, 16, 32, 16, 16], []),
                  A: Buffer(A_2: Pointer(float16), float16, [16, 14, 14, 16, 16, 16], [])}
       buffer_map = {A_1: A, W_1: W, Conv_1: Conv} {
       attr [IterVar(blockIdx.z: int32, (nullptr), "ThreadIndex", "blockIdx.z")] "thread_extent" = 196;
@@ -523,8 +523,8 @@ by mapping the 2D convolution to tensor intrinsics
 
     primfn(A_1: handle, W_1: handle, Conv_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {Conv: Buffer(Conv_2: Pointer(float32), float32, [16, 14, 14, 32, 16, 16], []),
-                 W: Buffer(W_2: Pointer(float16), float16, [3, 3, 16, 32, 16, 16], []),
+      buffers = {W: Buffer(W_2: Pointer(float16), float16, [3, 3, 16, 32, 16, 16], []),
+                 Conv: Buffer(Conv_2: Pointer(float32), float32, [16, 14, 14, 32, 16, 16], []),
                  A: Buffer(A_2: Pointer(float16), float16, [16, 14, 14, 16, 16, 16], [])}
       buffer_map = {A_1: A, W_1: W, Conv_1: Conv} {
       attr [IterVar(blockIdx.z: int32, (nullptr), "ThreadIndex", "blockIdx.z")] "thread_extent" = 196;
@@ -624,7 +624,7 @@ be able to run on our build server
 
  .. code-block:: none
 
-    conv2d with tensor core: 11.952753 ms
+    conv2d with tensor core: 13.384932 ms
 
 
 
diff --git a/docs/_sources/tutorials/optimize/opt_gemm.rst.txt b/docs/_sources/tutorials/optimize/opt_gemm.rst.txt
index 8ed61c3..58d76f3 100644
--- a/docs/_sources/tutorials/optimize/opt_gemm.rst.txt
+++ b/docs/_sources/tutorials/optimize/opt_gemm.rst.txt
@@ -118,8 +118,8 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
 
  .. code-block:: none
 
-    Numpy running time: 0.008755
-    Baseline: 3.522159
+    Numpy running time: 0.012972
+    Baseline: 3.358347
 
 
 
@@ -206,7 +206,7 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
 
  .. code-block:: none
 
-    Opt1: 0.290302
+    Opt1: 0.288556
 
 
 
@@ -300,7 +300,7 @@ In this tutorial, we chose to vectorize the inner loop row data since it is cach
 
  .. code-block:: none
 
-    Opt2: 0.323975
+    Opt2: 0.323788
 
 
 
@@ -324,8 +324,8 @@ Here is the generated IR after vectorization.
 
     primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
-                 B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
+      buffers = {B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
+                 C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
                  A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C} {
       for (x.outer: int32, 0, 32) {
@@ -389,7 +389,7 @@ the access pattern for A matrix is more cache friendly.
 
  .. code-block:: none
 
-    Opt3: 0.112113
+    Opt3: 0.112968
 
 
 
@@ -413,8 +413,8 @@ Here is the generated IR after loop permutation.
 
     primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
-                 C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
+      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
+                 B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
                  A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C} {
       for (x.outer: int32, 0, 32) {
@@ -499,7 +499,7 @@ the corresponding value from the packed array.
 
  .. code-block:: none
 
-    Opt4: 0.106036
+    Opt4: 0.106808
 
 
 
@@ -523,8 +523,8 @@ Here is the generated IR after array packing.
 
     primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
-                 B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
+      buffers = {B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
+                 C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
                  A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C} {
       attr [packedB: Pointer(float32)] "storage_scope" = "global";
@@ -609,7 +609,7 @@ write to C when all the block results are ready.
 
  .. code-block:: none
 
-    Opt5: 0.097966
+    Opt5: 0.097606
 
 
 
@@ -725,7 +725,7 @@ Futhermore, we can also utilize multi-core processors to do the thread-level par
 
  .. code-block:: none
 
-    Opt6: 0.031980
+    Opt6: 0.031942
 
 
 
@@ -749,8 +749,8 @@ Here is the generated IR after parallelization.
 
     primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
-                 C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
+      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
+                 B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
                  A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C} {
       attr [packedB: Pointer(float32)] "storage_scope" = "global";
diff --git a/docs/_sources/tutorials/optimize/sg_execution_times.rst.txt b/docs/_sources/tutorials/optimize/sg_execution_times.rst.txt
index 23c7f1d..dd17436 100644
--- a/docs/_sources/tutorials/optimize/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/optimize/sg_execution_times.rst.txt
@@ -5,9 +5,9 @@
 
 Computation times
 =================
-**00:27.695** total execution time for **tutorials_optimize** files:
+**00:27.651** total execution time for **tutorials_optimize** files:
 
-- **00:25.261**: :ref:`sphx_glr_tutorials_optimize_opt_gemm.py` (``opt_gemm.py``)
-- **00:01.219**: :ref:`sphx_glr_tutorials_optimize_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``)
-- **00:01.022**: :ref:`sphx_glr_tutorials_optimize_opt_conv_cuda.py` (``opt_conv_cuda.py``)
-- **00:00.193**: :ref:`sphx_glr_tutorials_optimize_opt_matmul_auto_tensorcore.py` (``opt_matmul_auto_tensorcore.py``)
+- **00:25.201**: :ref:`sphx_glr_tutorials_optimize_opt_gemm.py` (``opt_gemm.py``)
+- **00:01.236**: :ref:`sphx_glr_tutorials_optimize_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``)
+- **00:01.023**: :ref:`sphx_glr_tutorials_optimize_opt_conv_cuda.py` (``opt_conv_cuda.py``)
+- **00:00.191**: :ref:`sphx_glr_tutorials_optimize_opt_matmul_auto_tensorcore.py` (``opt_matmul_auto_tensorcore.py``)
diff --git a/docs/_sources/tutorials/topi/intro_topi.rst.txt b/docs/_sources/tutorials/topi/intro_topi.rst.txt
index fd981d6..13ad371 100644
--- a/docs/_sources/tutorials/topi/intro_topi.rst.txt
+++ b/docs/_sources/tutorials/topi/intro_topi.rst.txt
@@ -230,7 +230,7 @@ As you can see, scheduled stages of computation have been accumulated and we can
 
  .. code-block:: none
 
-    [stage(a, 0xba459db0), stage(b, 0x72f1bc40), stage(T_add, 0xce45cd30), stage(T_multiply, 0x12cdc5b40), stage(T_elemwise_sum, 0x1448e7aa0), stage(T_divide, 0x7e043690), stage(T_divide_red.rf, 0xd1159650), stage(T_divide_red, 0xb5be5260)]
+    [stage(a, 0xcd6241b0), stage(b, 0x13297e0f0), stage(T_add, 0xbb8bc250), stage(T_multiply, 0x7bef6060), stage(T_elemwise_sum, 0xd12a7050), stage(T_divide, 0xb17f3730), stage(T_divide_red.rf, 0xbbbb71a0), stage(T_divide_red, 0xc4ee03a0)]
 
 
 
diff --git a/docs/_sources/tutorials/topi/sg_execution_times.rst.txt b/docs/_sources/tutorials/topi/sg_execution_times.rst.txt
index 6849b21..52be67d 100644
--- a/docs/_sources/tutorials/topi/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/topi/sg_execution_times.rst.txt
@@ -5,6 +5,6 @@
 
 Computation times
 =================
-**00:00.655** total execution time for **tutorials_topi** files:
+**00:00.661** total execution time for **tutorials_topi** files:
 
-- **00:00.655**: :ref:`sphx_glr_tutorials_topi_intro_topi.py` (``intro_topi.py``)
+- **00:00.661**: :ref:`sphx_glr_tutorials_topi_intro_topi.py` (``intro_topi.py``)
diff --git a/docs/_sources/vta/tutorials/autotvm/sg_execution_times.rst.txt b/docs/_sources/vta/tutorials/autotvm/sg_execution_times.rst.txt
index 131f348..5fa9a21 100644
--- a/docs/_sources/vta/tutorials/autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/vta/tutorials/autotvm/sg_execution_times.rst.txt
@@ -5,6 +5,6 @@
 
 Computation times
 =================
-**00:07.897** total execution time for **vta_tutorials_autotvm** files:
+**00:07.412** total execution time for **vta_tutorials_autotvm** files:
 
-- **00:07.897**: :ref:`sphx_glr_vta_tutorials_autotvm_tune_relay_vta.py` (``tune_relay_vta.py``)
+- **00:07.412**: :ref:`sphx_glr_vta_tutorials_autotvm_tune_relay_vta.py` (``tune_relay_vta.py``)
diff --git a/docs/_sources/vta/tutorials/autotvm/tune_relay_vta.rst.txt b/docs/_sources/vta/tutorials/autotvm/tune_relay_vta.rst.txt
index 9a7fe7a..40951c2 100644
--- a/docs/_sources/vta/tutorials/autotvm/tune_relay_vta.rst.txt
+++ b/docs/_sources/vta/tutorials/autotvm/tune_relay_vta.rst.txt
@@ -497,7 +497,7 @@ Finally, we launch tuning jobs and evaluate the end-to-end performance.
  .. code-block:: none
 
     Extract tasks...
-
    ...1%, 0.01 MB, 11 KB/s, 0 seconds passed
    ...2%, 0.02 MB, 23 KB/s, 0 seconds passed
    ...3%, 0.02 MB, 35 KB/s, 0 seconds passed
    ...4%, 0.03 MB, 47 KB/s, 0 seconds passed
    ...5%, 0.04 MB, 58 KB/s, 0 seconds passed
    ...6%, 0.05 MB, 70 KB/s, 0 seconds passed
    ...7%, 0.05 MB, 81 KB/s, 0 seconds passed
    ...8%, 0.06 MB, 93 KB/s, 0 seconds passed
    ...9%, 0.07 MB, 104 KB/s, 0 seconds passed
    ...10%, 0.08 MB, 116 KB/s, 0 seconds passed
    ...11%, 0.09 MB, 127 KB/s, 0 seconds passed
    ...13%, 0.09 MB, 138 KB/s, 0 seconds passed
    ...14%, 0.10 MB, 150 KB/s, 0 seconds passed
    ...15%, 0.11 MB, 161 KB/s, 0 seconds passed
    ...16%, 0.12 MB, 172 KB/s, 0 seconds passed
    ...17%, 0.12 MB, 184 KB/s, 0 seconds passed
    ...18%, 0.13 MB, 195 KB/s, 0 seconds passed
    ...19%, 0.14 MB, 207 KB/s, 0 seconds passed
    ...20%, 0.15 MB, 218 KB/s, 0 seconds passed
    ...21%, 0.16 MB, 229 KB/s, 0 seconds passed
    ...22%, 0.16 MB, 241 KB/s, 0 seconds passed
    .
 ..23%, 0.17 MB, 252 KB/s, 0 seconds passed
    ...25%, 0.18 MB, 262 KB/s, 0 seconds passed
    ...26%, 0.19 MB, 274 KB/s, 0 seconds passed
    ...27%, 0.20 MB, 285 KB/s, 0 seconds passed
    ...28%, 0.20 MB, 296 KB/s, 0 seconds passed
    ...29%, 0.21 MB, 307 KB/s, 0 seconds passed
    ...30%, 0.22 MB, 319 KB/s, 0 seconds passed
    ...31%, 0.23 MB, 330 KB/s, 0 seconds passed
    ...32%, 0.23 MB, 341 KB/s, 0 seconds passed
    ...33%, 0.24 MB, 353 KB/s, 0 seconds passed
    ...34%, 0.25 MB, 363 KB/s, 0 seconds passed
    ...35%, 0.26 MB, 375 KB/s, 0 seconds passed
    ...36%, 0.27 MB, 386 KB/s, 0 seconds passed
    ...38%, 0.27 MB, 397 KB/s, 0 seconds passed
    ...39%, 0.28 MB, 408 KB/s, 0 seconds passed
    ...40%, 0.29 MB, 419 KB/s, 0 seconds passed
    ...41%, 0.30 MB, 431 KB/s, 0 seconds passed
    ...42%, 0.30 MB, 442 KB/s, 0 seconds passed
    ...43%, 0.31 MB, 453 KB/s, 0 seconds passed
    ...44%, 0.32 MB, 464 KB/s, 0 seconds passed
    ...45%, 0.33 MB, 476 KB/s, 0 seconds p
 assed
    ...46%, 0.34 MB, 487 KB/s, 0 seconds passed
    ...47%, 0.34 MB, 496 KB/s, 0 seconds passed
    ...48%, 0.35 MB, 507 KB/s, 0 seconds passed
    ...50%, 0.36 MB, 518 KB/s, 0 seconds passed
    ...51%, 0.37 MB, 529 KB/s, 0 seconds passed
    ...52%, 0.38 MB, 540 KB/s, 0 seconds passed
    ...53%, 0.38 MB, 551 KB/s, 0 seconds passed
    ...54%, 0.39 MB, 563 KB/s, 0 seconds passed
    ...55%, 0.40 MB, 574 KB/s, 0 seconds passed
    ...56%, 0.41 MB, 584 KB/s, 0 seconds passed
    ...57%, 0.41 MB, 596 KB/s, 0 seconds passed
    ...58%, 0.42 MB, 606 KB/s, 0 seconds passed
    ...59%, 0.43 MB, 617 KB/s, 0 seconds passed
    ...60%, 0.44 MB, 628 KB/s, 0 seconds passed
    ...62%, 0.45 MB, 639 KB/s, 0 seconds passed
    ...63%, 0.45 MB, 649 KB/s, 0 seconds passed
    ...64%, 0.46 MB, 661 KB/s, 0 seconds passed
    ...65%, 0.47 MB, 672 KB/s, 0 seconds passed
    ...66%, 0.48 MB, 683 KB/s, 0 seconds passed
    ...67%, 0.48 MB, 693 KB/s, 0 seconds passed
    ...68%, 0.49 MB, 704 KB/s, 
 0 seconds passed
    ...69%, 0.50 MB, 714 KB/s, 0 seconds passed
    ...70%, 0.51 MB, 725 KB/s, 0 seconds passed
    ...71%, 0.52 MB, 735 KB/s, 0 seconds passed
    ...72%, 0.52 MB, 746 KB/s, 0 seconds passed
    ...73%, 0.53 MB, 757 KB/s, 0 seconds passed
    ...75%, 0.54 MB, 768 KB/s, 0 seconds passed
    ...76%, 0.55 MB, 779 KB/s, 0 seconds passed
    ...77%, 0.55 MB, 790 KB/s, 0 seconds passed
    ...78%, 0.56 MB, 800 KB/s, 0 seconds passed
    ...79%, 0.57 MB, 811 KB/s, 0 seconds passed
    ...80%, 0.58 MB, 821 KB/s, 0 seconds passed
    ...81%, 0.59 MB, 832 KB/s, 0 seconds passed
    ...82%, 0.59 MB, 843 KB/s, 0 seconds passed
    ...83%, 0.60 MB, 854 KB/s, 0 seconds passed
    ...84%, 0.61 MB, 865 KB/s, 0 seconds passed
    ...85%, 0.62 MB, 876 KB/s, 0 seconds passed
    ...87%, 0.62 MB, 886 KB/s, 0 seconds passed
    ...88%, 0.63 MB, 897 KB/s, 0 seconds passed
    ...89%, 0.64 MB, 907 KB/s, 0 seconds passed
    ...90%, 0.65 MB, 918 KB/s, 0 seconds passed
    ...91%, 0.66 MB,
  929 KB/s, 0 seconds passed
    ...92%, 0.66 MB, 940 KB/s, 0 seconds passed
    ...93%, 0.67 MB, 951 KB/s, 0 seconds passed
    ...94%, 0.68 MB, 957 KB/s, 0 seconds passed
    ...95%, 0.69 MB, 968 KB/s, 0 seconds passed
    ...96%, 0.70 MB, 979 KB/s, 0 seconds passed
    ...97%, 0.70 MB, 990 KB/s, 0 seconds passed
    ...99%, 0.71 MB, 1000 KB/s, 0 seconds passed
    ...100%, 0.72 MB, 1011 KB/s, 0 seconds passed
+
    ...1%, 0.01 MB, 35 KB/s, 0 seconds passed
    ...2%, 0.02 MB, 68 KB/s, 0 seconds passed
    ...3%, 0.02 MB, 102 KB/s, 0 seconds passed
    ...4%, 0.03 MB, 135 KB/s, 0 seconds passed
    ...5%, 0.04 MB, 169 KB/s, 0 seconds passed
    ...6%, 0.05 MB, 199 KB/s, 0 seconds passed
    ...7%, 0.05 MB, 232 KB/s, 0 seconds passed
    ...8%, 0.06 MB, 264 KB/s, 0 seconds passed
    ...9%, 0.07 MB, 297 KB/s, 0 seconds passed
    ...10%, 0.08 MB, 329 KB/s, 0 seconds passed
    ...11%, 0.09 MB, 362 KB/s, 0 seconds passed
    ...13%, 0.09 MB, 389 KB/s, 0 seconds passed
    ...14%, 0.10 MB, 422 KB/s, 0 seconds passed
    ...15%, 0.11 MB, 452 KB/s, 0 seconds passed
    ...16%, 0.12 MB, 484 KB/s, 0 seconds passed
    ...17%, 0.12 MB, 515 KB/s, 0 seconds passed
    ...18%, 0.13 MB, 547 KB/s, 0 seconds passed
    ...19%, 0.14 MB, 579 KB/s, 0 seconds passed
    ...20%, 0.15 MB, 610 KB/s, 0 seconds passed
    ...21%, 0.16 MB, 642 KB/s, 0 seconds passed
    ...22%, 0.16 MB, 673 KB/s, 0 seconds passed
 
    ...23%, 0.17 MB, 704 KB/s, 0 seconds passed
    ...25%, 0.18 MB, 736 KB/s, 0 seconds passed
    ...26%, 0.19 MB, 768 KB/s, 0 seconds passed
    ...27%, 0.20 MB, 792 KB/s, 0 seconds passed
    ...28%, 0.20 MB, 820 KB/s, 0 seconds passed
    ...29%, 0.21 MB, 851 KB/s, 0 seconds passed
    ...30%, 0.22 MB, 883 KB/s, 0 seconds passed
    ...31%, 0.23 MB, 912 KB/s, 0 seconds passed
    ...32%, 0.23 MB, 943 KB/s, 0 seconds passed
    ...33%, 0.24 MB, 975 KB/s, 0 seconds passed
    ...34%, 0.25 MB, 1006 KB/s, 0 seconds passed
    ...35%, 0.26 MB, 1036 KB/s, 0 seconds passed
    ...36%, 0.27 MB, 1067 KB/s, 0 seconds passed
    ...38%, 0.27 MB, 1098 KB/s, 0 seconds passed
    ...39%, 0.28 MB, 1129 KB/s, 0 seconds passed
    ...40%, 0.29 MB, 1159 KB/s, 0 seconds passed
    ...41%, 0.30 MB, 1190 KB/s, 0 seconds passed
    ...42%, 0.30 MB, 1221 KB/s, 0 seconds passed
    ...43%, 0.31 MB, 1252 KB/s, 0 seconds passed
    ...44%, 0.32 MB, 1283 KB/s, 0 seconds passed
    ...45%, 0.33 MB, 1314 
 KB/s, 0 seconds passed
    ...46%, 0.34 MB, 1342 KB/s, 0 seconds passed
    ...47%, 0.34 MB, 1373 KB/s, 0 seconds passed
    ...48%, 0.35 MB, 1393 KB/s, 0 seconds passed
    ...50%, 0.36 MB, 1423 KB/s, 0 seconds passed
    ...51%, 0.37 MB, 1454 KB/s, 0 seconds passed
    ...52%, 0.38 MB, 1484 KB/s, 0 seconds passed
    ...53%, 0.38 MB, 1515 KB/s, 0 seconds passed
    ...54%, 0.39 MB, 1545 KB/s, 0 seconds passed
    ...55%, 0.40 MB, 1576 KB/s, 0 seconds passed
    ...56%, 0.41 MB, 1606 KB/s, 0 seconds passed
    ...57%, 0.41 MB, 1628 KB/s, 0 seconds passed
    ...58%, 0.42 MB, 1658 KB/s, 0 seconds passed
    ...59%, 0.43 MB, 1689 KB/s, 0 seconds passed
    ...60%, 0.44 MB, 1719 KB/s, 0 seconds passed
    ...62%, 0.45 MB, 1748 KB/s, 0 seconds passed
    ...63%, 0.45 MB, 1778 KB/s, 0 seconds passed
    ...64%, 0.46 MB, 1809 KB/s, 0 seconds passed
    ...65%, 0.47 MB, 1839 KB/s, 0 seconds passed
    ...66%, 0.48 MB, 1869 KB/s, 0 seconds passed
    ...67%, 0.48 MB, 1899 KB/s, 0 seconds p
 assed
    ...68%, 0.49 MB, 1929 KB/s, 0 seconds passed
    ...69%, 0.50 MB, 1959 KB/s, 0 seconds passed
    ...70%, 0.51 MB, 1990 KB/s, 0 seconds passed
    ...71%, 0.52 MB, 2019 KB/s, 0 seconds passed
    ...72%, 0.52 MB, 2049 KB/s, 0 seconds passed
    ...73%, 0.53 MB, 2079 KB/s, 0 seconds passed
    ...75%, 0.54 MB, 2109 KB/s, 0 seconds passed
    ...76%, 0.55 MB, 2139 KB/s, 0 seconds passed
    ...77%, 0.55 MB, 2169 KB/s, 0 seconds passed
    ...78%, 0.56 MB, 2198 KB/s, 0 seconds passed
    ...79%, 0.57 MB, 2228 KB/s, 0 seconds passed
    ...80%, 0.58 MB, 2258 KB/s, 0 seconds passed
    ...81%, 0.59 MB, 2288 KB/s, 0 seconds passed
    ...82%, 0.59 MB, 2318 KB/s, 0 seconds passed
    ...83%, 0.60 MB, 2348 KB/s, 0 seconds passed
    ...84%, 0.61 MB, 2377 KB/s, 0 seconds passed
    ...85%, 0.62 MB, 2407 KB/s, 0 seconds passed
    ...87%, 0.62 MB, 2436 KB/s, 0 seconds passed
    ...88%, 0.63 MB, 2466 KB/s, 0 seconds passed
    ...89%, 0.64 MB, 2496 KB/s, 0 seconds passed
    ...90%,
  0.65 MB, 2526 KB/s, 0 seconds passed
    ...91%, 0.66 MB, 2556 KB/s, 0 seconds passed
    ...92%, 0.66 MB, 2585 KB/s, 0 seconds passed
    ...93%, 0.67 MB, 2615 KB/s, 0 seconds passed
    ...94%, 0.68 MB, 2645 KB/s, 0 seconds passed
    ...95%, 0.69 MB, 2675 KB/s, 0 seconds passed
    ...96%, 0.70 MB, 2704 KB/s, 0 seconds passed
    ...97%, 0.70 MB, 2734 KB/s, 0 seconds passed
    ...99%, 0.71 MB, 2752 KB/s, 0 seconds passed
    ...100%, 0.72 MB, 2780 KB/s, 0 seconds passed
     Extracted 10 conv2d tasks:
     (1, 14, 14, 256, 512, 1, 1, 0, 0, 2, 2)
     (1, 28, 28, 128, 256, 1, 1, 0, 0, 2, 2)
diff --git a/docs/_sources/vta/tutorials/frontend/deploy_classification.rst.txt b/docs/_sources/vta/tutorials/frontend/deploy_classification.rst.txt
index 02cce99..0d3b9f1 100644
--- a/docs/_sources/vta/tutorials/frontend/deploy_classification.rst.txt
+++ b/docs/_sources/vta/tutorials/frontend/deploy_classification.rst.txt
@@ -243,8 +243,8 @@ The compilation steps are:
 
  .. code-block:: none
 
-
    ...12%, 0.01 MB, 41 KB/s, 0 seconds passed
    ...25%, 0.02 MB, 78 KB/s, 0 seconds passed
    ...38%, 0.02 MB, 118 KB/s, 0 seconds passed
    ...51%, 0.03 MB, 157 KB/s, 0 seconds passed
    ...64%, 0.04 MB, 194 KB/s, 0 seconds passed
    ...77%, 0.05 MB, 226 KB/s, 0 seconds passed
    ...89%, 0.05 MB, 263 KB/s, 0 seconds passed
    ...100%, 0.06 MB, 299 KB/s, 0 seconds passed
-    resnet18_v1 inference graph built in 8.14s!
+
    ...12%, 0.01 MB, 13 KB/s, 0 seconds passed
    ...25%, 0.02 MB, 25 KB/s, 0 seconds passed
    ...38%, 0.02 MB, 38 KB/s, 0 seconds passed
    ...51%, 0.03 MB, 51 KB/s, 0 seconds passed
    ...64%, 0.04 MB, 64 KB/s, 0 seconds passed
    ...77%, 0.05 MB, 76 KB/s, 0 seconds passed
    ...89%, 0.05 MB, 89 KB/s, 0 seconds passed
    ...100%, 0.06 MB, 101 KB/s, 0 seconds passed
+    resnet18_v1 inference graph built in 8.20s!
 
 
 
diff --git a/docs/_sources/vta/tutorials/frontend/sg_execution_times.rst.txt b/docs/_sources/vta/tutorials/frontend/sg_execution_times.rst.txt
index 9240241..a5928a5 100644
--- a/docs/_sources/vta/tutorials/frontend/sg_execution_times.rst.txt
+++ b/docs/_sources/vta/tutorials/frontend/sg_execution_times.rst.txt
@@ -5,6 +5,6 @@
 
 Computation times
 =================
-**00:29.475** total execution time for **vta_tutorials_frontend** files:
+**00:30.449** total execution time for **vta_tutorials_frontend** files:
 
-- **00:29.475**: :ref:`sphx_glr_vta_tutorials_frontend_deploy_classification.py` (``deploy_classification.py``)
+- **00:30.449**: :ref:`sphx_glr_vta_tutorials_frontend_deploy_classification.py` (``deploy_classification.py``)
diff --git a/docs/_sources/vta/tutorials/optimize/convolution_opt.rst.txt b/docs/_sources/vta/tutorials/optimize/convolution_opt.rst.txt
index 88370f8..e21cb9a 100644
--- a/docs/_sources/vta/tutorials/optimize/convolution_opt.rst.txt
+++ b/docs/_sources/vta/tutorials/optimize/convolution_opt.rst.txt
@@ -631,8 +631,8 @@ and mapping the shift, and clipping computation to the vector ALU.
 
     primfn(data_1: handle, kernel_1: handle, res_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {res: Buffer(res_2: Pointer(int8), int8, [1, 16, 14, 14, 1, 16], []),
-                 kernel: Buffer(kernel_2: Pointer(int8), int8, [16, 16, 3, 3, 16, 16], []),
+      buffers = {kernel: Buffer(kernel_2: Pointer(int8), int8, [16, 16, 3, 3, 16, 16], []),
+                 res: Buffer(res_2: Pointer(int8), int8, [1, 16, 14, 14, 1, 16], []),
                  data: Buffer(data_2: Pointer(int8), int8, [1, 16, 14, 14, 1, 16], [])}
       buffer_map = {data_1: data, kernel_1: kernel, res_1: res} {
       attr [res_conv: Pointer(int32)] "storage_scope" = "local.acc_buffer";
diff --git a/docs/_sources/vta/tutorials/optimize/matrix_multiply_opt.rst.txt b/docs/_sources/vta/tutorials/optimize/matrix_multiply_opt.rst.txt
index 34d0008..9f038d5 100644
--- a/docs/_sources/vta/tutorials/optimize/matrix_multiply_opt.rst.txt
+++ b/docs/_sources/vta/tutorials/optimize/matrix_multiply_opt.rst.txt
@@ -351,8 +351,8 @@ below:
 
     primfn(data_1: handle, weight_1: handle, res_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {weight: Buffer(weight_2: Pointer(int8), int8, [64, 64, 16, 16], []),
-                 res: Buffer(res_2: Pointer(int8), int8, [1, 64, 1, 16], []),
+      buffers = {res: Buffer(res_2: Pointer(int8), int8, [1, 64, 1, 16], []),
+                 weight: Buffer(weight_2: Pointer(int8), int8, [64, 64, 16, 16], []),
                  data: Buffer(data_2: Pointer(int8), int8, [1, 64, 1, 16], [])}
       buffer_map = {data_1: data, weight_1: weight, res_1: res} {
       attr [data_buf: Pointer(int8)] "storage_scope" = "global";
@@ -494,8 +494,8 @@ and mapping the shift, and clipping computation to the vector ALU.
 
     primfn(data_1: handle, weight_1: handle, res_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {weight: Buffer(weight_2: Pointer(int8), int8, [64, 64, 16, 16], []),
-                 res: Buffer(res_2: Pointer(int8), int8, [1, 64, 1, 16], []),
+      buffers = {res: Buffer(res_2: Pointer(int8), int8, [1, 64, 1, 16], []),
+                 weight: Buffer(weight_2: Pointer(int8), int8, [64, 64, 16, 16], []),
                  data: Buffer(data_2: Pointer(int8), int8, [1, 64, 1, 16], [])}
       buffer_map = {data_1: data, weight_1: weight, res_1: res} {
       attr [res_gem: Pointer(int32)] "storage_scope" = "local.acc_buffer";
diff --git a/docs/_sources/vta/tutorials/optimize/sg_execution_times.rst.txt b/docs/_sources/vta/tutorials/optimize/sg_execution_times.rst.txt
index 2466984..7c50bf2 100644
--- a/docs/_sources/vta/tutorials/optimize/sg_execution_times.rst.txt
+++ b/docs/_sources/vta/tutorials/optimize/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
 
 Computation times
 =================
-**00:03.810** total execution time for **vta_tutorials_optimize** files:
+**00:03.843** total execution time for **vta_tutorials_optimize** files:
 
-- **00:03.261**: :ref:`sphx_glr_vta_tutorials_optimize_convolution_opt.py` (``convolution_opt.py``)
-- **00:00.550**: :ref:`sphx_glr_vta_tutorials_optimize_matrix_multiply_opt.py` (``matrix_multiply_opt.py``)
+- **00:03.283**: :ref:`sphx_glr_vta_tutorials_optimize_convolution_opt.py` (``convolution_opt.py``)
+- **00:00.560**: :ref:`sphx_glr_vta_tutorials_optimize_matrix_multiply_opt.py` (``matrix_multiply_opt.py``)
diff --git a/docs/_sources/vta/tutorials/sg_execution_times.rst.txt b/docs/_sources/vta/tutorials/sg_execution_times.rst.txt
index 6522db2..8508a4a 100644
--- a/docs/_sources/vta/tutorials/sg_execution_times.rst.txt
+++ b/docs/_sources/vta/tutorials/sg_execution_times.rst.txt
@@ -5,7 +5,7 @@
 
 Computation times
 =================
-**00:01.014** total execution time for **vta_tutorials** files:
+**00:01.011** total execution time for **vta_tutorials** files:
 
-- **00:00.513**: :ref:`sphx_glr_vta_tutorials_matrix_multiply.py` (``matrix_multiply.py``)
+- **00:00.511**: :ref:`sphx_glr_vta_tutorials_matrix_multiply.py` (``matrix_multiply.py``)
 - **00:00.501**: :ref:`sphx_glr_vta_tutorials_vta_get_started.py` (``vta_get_started.py``)
diff --git a/docs/_sources/vta/tutorials/vta_get_started.rst.txt b/docs/_sources/vta/tutorials/vta_get_started.rst.txt
index 8f6368b..3cfd63a 100644
--- a/docs/_sources/vta/tutorials/vta_get_started.rst.txt
+++ b/docs/_sources/vta/tutorials/vta_get_started.rst.txt
@@ -423,8 +423,8 @@ with an :code:`env.alu` pragma.
 
     primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {C: Buffer(C_2: Pointer(int8), int8, [1, 64, 1, 16], []),
-                 B: Buffer(B_2: Pointer(int32), int32, [1, 64, 1, 16], []),
+      buffers = {B: Buffer(B_2: Pointer(int32), int32, [1, 64, 1, 16], []),
+                 C: Buffer(C_2: Pointer(int8), int8, [1, 64, 1, 16], []),
                  A: Buffer(A_2: Pointer(int32), int32, [1, 64, 1, 16], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C} {
       attr [A_buf: Pointer(int32)] "storage_scope" = "local.acc_buffer" {
diff --git a/docs/api/python/auto_scheduler.html b/docs/api/python/auto_scheduler.html
index 88fe5c6..bcf9919 100644
--- a/docs/api/python/auto_scheduler.html
+++ b/docs/api/python/auto_scheduler.html
@@ -139,12 +139,7 @@
 <li class="toctree-l2"><a class="reference internal" href="relay/dataflow_pattern.html">tvm.relay.dataflow_pattern</a></li>
 <li class="toctree-l2"><a class="reference internal" href="relay/testing.html">tvm.relay.testing</a></li>
 <li class="toctree-l2"><a class="reference internal" href="autotvm.html">tvm.autotvm</a></li>
-<li class="toctree-l2 current"><a class="current reference internal" href="#">tvm.auto_scheduler</a><ul>
-<li class="toctree-l3"><a class="reference internal" href="#module-tvm.auto_scheduler.auto_schedule">tvm.auto_scheduler.auto_schedule</a></li>
-<li class="toctree-l3"><a class="reference internal" href="#tvm-auto-scheduler-workload-registry">tvm.auto_scheduler.workload_registry</a></li>
-<li class="toctree-l3"><a class="reference internal" href="#module-tvm.auto_scheduler.measure">tvm.auto_scheduler.measure</a></li>
-</ul>
-</li>
+<li class="toctree-l2 current"><a class="current reference internal" href="#">tvm.auto_scheduler</a></li>
 <li class="toctree-l2"><a class="reference internal" href="rpc.html">tvm.rpc</a></li>
 <li class="toctree-l2"><a class="reference internal" href="micro.html">tvm.micro</a></li>
 <li class="toctree-l2"><a class="reference internal" href="contrib.html">tvm.contrib</a></li>
@@ -235,35 +230,116 @@
   <div class="section" id="module-tvm.auto_scheduler">
 <span id="tvm-auto-scheduler"></span><h1>tvm.auto_scheduler<a class="headerlink" href="#module-tvm.auto_scheduler" title="Permalink to this headline">¶</a></h1>
 <p>Namespace for TVM Auto-scheduler.</p>
-<div class="section" id="module-tvm.auto_scheduler.auto_schedule">
-<span id="tvm-auto-scheduler-auto-schedule"></span><h2>tvm.auto_scheduler.auto_schedule<a class="headerlink" href="#module-tvm.auto_scheduler.auto_schedule" title="Permalink to this headline">¶</a></h2>
-<p>User interface for TVM Auto-scheduler.</p>
-<p>The basic schedule search process for TVM Auto-scheduler is designed to be:
-<cite>Program sampling</cite> -&gt; <cite>Performance Tuning</cite>.</p>
-<p>In <cite>Program sampling</cite>, we use some predefined precise or heuristic rules to generate several
-initial schedules. Based on these initial starting points, we perform <cite>Performance Tuning</cite> which
-uses cost model based evolutionary search to select schedules with the best performance.</p>
-<p>Candidate schedules are measured against the specific hardware target.</p>
+<p><strong>Classes</strong></p>
+<table class="longtable docutils align-default">
+<colgroup>
+<col style="width: 10%" />
+<col style="width: 90%" />
+</colgroup>
+<tbody>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.ComputeDAG" title="tvm.auto_scheduler.ComputeDAG"><code class="xref py py-obj docutils literal notranslate"><span class="pre">ComputeDAG</span></code></a>(compute)</p></td>
+<td><p>The auto-scheduler’s computational graph and related program analyses.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.EmptyPolicy" title="tvm.auto_scheduler.EmptyPolicy"><code class="xref py py-obj docutils literal notranslate"><span class="pre">EmptyPolicy</span></code></a>(task[, init_search_callbacks])</p></td>
+<td><p>This is an example empty search policy which will always generate the init state of ComputeDAG.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.HardwareParams" title="tvm.auto_scheduler.HardwareParams"><code class="xref py py-obj docutils literal notranslate"><span class="pre">HardwareParams</span></code></a>(num_cores, vector_unit_bytes, …)</p></td>
+<td><p>The parameters of target hardware used to guide the search policy</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.LocalBuilder" title="tvm.auto_scheduler.LocalBuilder"><code class="xref py py-obj docutils literal notranslate"><span class="pre">LocalBuilder</span></code></a>([timeout, n_parallel, build_func])</p></td>
+<td><p>LocalBuilder use local CPU cores to build programs in parallel.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.LocalRPCMeasureContext" title="tvm.auto_scheduler.LocalRPCMeasureContext"><code class="xref py py-obj docutils literal notranslate"><span class="pre">LocalRPCMeasureContext</span></code></a>([priority, …])</p></td>
+<td><p>A context wrapper for running RPCRunner locally.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.LocalRunner" title="tvm.auto_scheduler.LocalRunner"><code class="xref py py-obj docutils literal notranslate"><span class="pre">LocalRunner</span></code></a>([timeout, number, repeat, …])</p></td>
+<td><p>LocalRunner that uses local CPU/GPU to measures the time cost of programs.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.MeasureInput" title="tvm.auto_scheduler.MeasureInput"><code class="xref py py-obj docutils literal notranslate"><span class="pre">MeasureInput</span></code></a>(task, state)</p></td>
+<td><p>Store the input of a measurement.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.MeasureResult" title="tvm.auto_scheduler.MeasureResult"><code class="xref py py-obj docutils literal notranslate"><span class="pre">MeasureResult</span></code></a>(costs, error_no, error_msg, …)</p></td>
+<td><p>Store the results of a measurement.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.PreloadMeasuredStates" title="tvm.auto_scheduler.PreloadMeasuredStates"><code class="xref py py-obj docutils literal notranslate"><span class="pre">PreloadMeasuredStates</span></code></a>([filename])</p></td>
+<td><p>A SearchCallback to load measured states from the log file for a search policy.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.RPCRunner" title="tvm.auto_scheduler.RPCRunner"><code class="xref py py-obj docutils literal notranslate"><span class="pre">RPCRunner</span></code></a>(key, host, port[, priority, …])</p></td>
+<td><p>RPCRunner that uses RPC call to measures the time cost of programs on remote devices.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.RandomModel" title="tvm.auto_scheduler.RandomModel"><code class="xref py py-obj docutils literal notranslate"><span class="pre">RandomModel</span></code></a>()</p></td>
+<td><p>A model returns random estimation for all inputs</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.RecordReader" title="tvm.auto_scheduler.RecordReader"><code class="xref py py-obj docutils literal notranslate"><span class="pre">RecordReader</span></code></a>([filename])</p></td>
+<td><p>Reader of the json log file.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.RecordToFile" title="tvm.auto_scheduler.RecordToFile"><code class="xref py py-obj docutils literal notranslate"><span class="pre">RecordToFile</span></code></a>([filename])</p></td>
+<td><p>A measurement callback that writes measurement records into a file.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.SearchTask" title="tvm.auto_scheduler.SearchTask"><code class="xref py py-obj docutils literal notranslate"><span class="pre">SearchTask</span></code></a>(dag, workload_key, target[, …])</p></td>
+<td><p>The computation information and hardware parameters for a schedule search task.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.SketchPolicy" title="tvm.auto_scheduler.SketchPolicy"><code class="xref py py-obj docutils literal notranslate"><span class="pre">SketchPolicy</span></code></a>(task[, program_cost_model, …])</p></td>
+<td><p>The search policy that searches in a hierarchical search space defined by sketches.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.TuningOptions" title="tvm.auto_scheduler.TuningOptions"><code class="xref py py-obj docutils literal notranslate"><span class="pre">TuningOptions</span></code></a>([num_measure_trials, …])</p></td>
+<td><p>This controls the options of performance tuning.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.XGBModel" title="tvm.auto_scheduler.XGBModel"><code class="xref py py-obj docutils literal notranslate"><span class="pre">XGBModel</span></code></a>([verbose_eval, num_warmup_sample, seed])</p></td>
+<td><p>Train a XGBoost model to predict the normalized throughputs of programs.</p></td>
+</tr>
+</tbody>
+</table>
+<p><strong>Functions</strong></p>
+<table class="longtable docutils align-default">
+<colgroup>
+<col style="width: 10%" />
+<col style="width: 90%" />
+</colgroup>
+<tbody>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.auto_schedule" title="tvm.auto_scheduler.auto_schedule"><code class="xref py py-obj docutils literal notranslate"><span class="pre">auto_schedule</span></code></a>(task[, search_policy, …])</p></td>
+<td><p>Run auto scheduling search for a task</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.create_task" title="tvm.auto_scheduler.create_task"><code class="xref py py-obj docutils literal notranslate"><span class="pre">create_task</span></code></a>(func, args, target[, …])</p></td>
+<td><p>Create a search task</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.load_best" title="tvm.auto_scheduler.load_best"><code class="xref py py-obj docutils literal notranslate"><span class="pre">load_best</span></code></a>(filename[, workload_key, target])</p></td>
+<td><p>Return the best measurement pair form a log file.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.load_records" title="tvm.auto_scheduler.load_records"><code class="xref py py-obj docutils literal notranslate"><span class="pre">load_records</span></code></a>(filename)</p></td>
+<td><p>Load measurement records from a file.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.make_workload_key" title="tvm.auto_scheduler.make_workload_key"><code class="xref py py-obj docutils literal notranslate"><span class="pre">make_workload_key</span></code></a>(func, args)</p></td>
+<td><p>Make a workload key by function and arguments.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.register_workload" title="tvm.auto_scheduler.register_workload"><code class="xref py py-obj docutils literal notranslate"><span class="pre">register_workload</span></code></a>(func_name[, f, override])</p></td>
+<td><p>Register a function that generates a certain workload.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.save_records" title="tvm.auto_scheduler.save_records"><code class="xref py py-obj docutils literal notranslate"><span class="pre">save_records</span></code></a>(filename, inputs, results)</p></td>
+<td><p>Append measure records to file.</p></td>
+</tr>
+</tbody>
+</table>
 <dl class="py class">
-<dt id="tvm.auto_scheduler.auto_schedule.SearchTask">
-<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.auto_schedule.</code><code class="sig-name descname">SearchTask</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">dag</span></em>, <em class="sig-param"><span class="n">workload_key</span></em>, <em class="sig-param"><span class="n">target</span></em>, <em class="sig-param"><span class="n">target_host</span><span class="o">=</span><span class="default_value">None</span></e [...]
+<dt id="tvm.auto_scheduler.SearchTask">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">SearchTask</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">dag</span></em>, <em class="sig-param"><span class="n">workload_key</span></em>, <em class="sig-param"><span class="n">target</span></em>, <em class="sig-param"><span class="n">target_host</span><span class="o">=</span><span class="default_value">None</span></em>, <em class= [...]
 <dd><p>The computation information and hardware parameters for a schedule search task.</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
 <dd class="field-odd"><ul class="simple">
-<li><p><strong>dag</strong> (<em>ComputeDAG</em>) – The ComputeDAG for the corresponding compute declaration.</p></li>
+<li><p><strong>dag</strong> (<a class="reference internal" href="#tvm.auto_scheduler.ComputeDAG" title="tvm.auto_scheduler.ComputeDAG"><em>ComputeDAG</em></a>) – The ComputeDAG for the corresponding compute declaration.</p></li>
 <li><p><strong>workload_key</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a>) – The workload key for the corresponding compute declaration.</p></li>
 <li><p><strong>target</strong> (<a class="reference internal" href="target.html#tvm.target.Target" title="tvm.target.Target"><em>tvm.target.Target</em></a>) – The target device of this search task.</p></li>
 <li><p><strong>target_host</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="target.html#tvm.target.Target" title="tvm.target.Target"><em>tvm.target.Target</em></a><em>]</em>) – The target host device of this search task.</p></li>
-<li><p><strong>hardware_params</strong> (<em>Optional</em><em>[</em><em>HardwareParams</em><em>]</em>) – Hardware parameters used in this search task.</p></li>
+<li><p><strong>hardware_params</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="#tvm.auto_scheduler.HardwareParams" title="tvm.auto_scheduler.HardwareParams"><em>HardwareParams</em></a><em>]</em>) – Hardware parameters used in this search task.</p></li>
 </ul>
 </dd>
 </dl>
 </dd></dl>
 
 <dl class="py class">
-<dt id="tvm.auto_scheduler.auto_schedule.TuningOptions">
-<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.auto_schedule.</code><code class="sig-name descname">TuningOptions</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">num_measure_trials</span><span class="o">=</span><span class="default_value">0</span></em>, <em class="sig-param"><span class="n">early_stopping</span><span class="o">=</span><span class="default_value">None</span></em>, <em class="sig-param"><span class="n" [...]
+<dt id="tvm.auto_scheduler.TuningOptions">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">TuningOptions</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">num_measure_trials</span><span class="o">=</span><span class="default_value">0</span></em>, <em class="sig-param"><span class="n">early_stopping</span><span class="o">=</span><span class="default_value">None</span></em>, <em class="sig-param"><span class="n">num_measures_ [...]
 <dd><p>This controls the options of performance tuning.</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
@@ -288,9 +364,26 @@ Candidates:
 </dl>
 </dd></dl>
 
+<dl class="py class">
+<dt id="tvm.auto_scheduler.HardwareParams">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">HardwareParams</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">num_cores</span></em>, <em class="sig-param"><span class="n">vector_unit_bytes</span></em>, <em class="sig-param"><span class="n">cache_line_bytes</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.HardwareParams" title="Permalink  [...]
+<dd><p>The parameters of target hardware used to guide the search policy</p>
+<p>TODO(jcf94): This is considered to be merged with the new Target specification:
+<a class="reference external" href="https://discuss.tvm.ai/t/rfc-tvm-target-specification/6844">https://discuss.tvm.ai/t/rfc-tvm-target-specification/6844</a></p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>num_cores</strong> (<a class="reference external" href="https://docs.python.org/3/library/functions.html#int" title="(in Python v3.8)"><em>int</em></a>) – The number of device cores.</p></li>
+<li><p><strong>vector_unit_bytes</strong> (<a class="reference external" href="https://docs.python.org/3/library/functions.html#int" title="(in Python v3.8)"><em>int</em></a>) – The width of vector units in bytes.</p></li>
+<li><p><strong>cache_line_bytes</strong> (<a class="reference external" href="https://docs.python.org/3/library/functions.html#int" title="(in Python v3.8)"><em>int</em></a>) – The size of cache line in bytes.</p></li>
+</ul>
+</dd>
+</dl>
+</dd></dl>
+
 <dl class="py function">
-<dt id="tvm.auto_scheduler.auto_schedule.create_task">
-<code class="sig-prename descclassname">tvm.auto_scheduler.auto_schedule.</code><code class="sig-name descname">create_task</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">func</span></em>, <em class="sig-param"><span class="n">args</span></em>, <em class="sig-param"><span class="n">target</span></em>, <em class="sig-param"><span class="n">target_host</span><span class="o">=</span><span class="default_value">None</span></em>, <em class="sig-param"><span class= [...]
+<dt id="tvm.auto_scheduler.create_task">
+<code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">create_task</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">func</span></em>, <em class="sig-param"><span class="n">args</span></em>, <em class="sig-param"><span class="n">target</span></em>, <em class="sig-param"><span class="n">target_host</span><span class="o">=</span><span class="default_value">None</span></em>, <em class="sig-param"><span class="n">hardware_p [...]
 <dd><p>Create a search task</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
@@ -300,7 +393,7 @@ Can be the a function or the function name.</p></li>
 <li><p><strong>args</strong> (<em>Union</em><em>[</em><a class="reference internal" href="relay/index.html#tvm.relay.Tuple" title="tvm.relay.Tuple"><em>Tuple</em></a><em>[</em><a class="reference internal" href="tir.html#tvm.tir.Any" title="tvm.tir.Any"><em>Any</em></a><em>, </em><em>..</em><em>]</em><em>, </em><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List"><em>List</em></a><em>[</em><a class="refer [...]
 <li><p><strong>target</strong> (<a class="reference internal" href="target.html#tvm.target.Target" title="tvm.target.Target"><em>tvm.target.Target</em></a>) – The target device of this search task.</p></li>
 <li><p><strong>target_host</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="target.html#tvm.target.Target" title="tvm.target.Target"><em>tvm.target.Target</em></a><em>]</em>) – The target host device of this search task.</p></li>
-<li><p><strong>hardware_params</strong> (<em>Optional</em><em>[</em><em>HardwareParams</em><em>]</em>) – Hardware parameters used in this search task.</p></li>
+<li><p><strong>hardware_params</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="#tvm.auto_scheduler.HardwareParams" title="tvm.auto_scheduler.HardwareParams"><em>HardwareParams</em></a><em>]</em>) – Hardware parameters used in this search task.</p></li>
 </ul>
 </dd>
 <dt class="field-even">Returns</dt>
@@ -313,15 +406,89 @@ Can be the a function or the function name.</p></li>
 </dd></dl>
 
 <dl class="py function">
-<dt id="tvm.auto_scheduler.auto_schedule.auto_schedule">
-<code class="sig-prename descclassname">tvm.auto_scheduler.auto_schedule.</code><code class="sig-name descname">auto_schedule</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">task</span></em>, <em class="sig-param"><span class="n">search_policy</span><span class="o">=</span><span class="default_value">None</span></em>, <em class="sig-param"><span class="n">tuning_options</span><span class="o">=</span><span class="default_value">auto_scheduler.TuningOptions(1786 [...]
+<dt id="tvm.auto_scheduler.auto_schedule">
+<code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">auto_schedule</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">task</span></em>, <em class="sig-param"><span class="n">search_policy</span><span class="o">=</span><span class="default_value">None</span></em>, <em class="sig-param"><span class="n">tuning_options</span><span class="o">=</span><span class="default_value">auto_scheduler.TuningOptions(43227072)</span></ [...]
 <dd><p>Run auto scheduling search for a task</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
 <dd class="field-odd"><ul class="simple">
-<li><p><strong>task</strong> (<a class="reference internal" href="#tvm.auto_scheduler.auto_schedule.SearchTask" title="tvm.auto_scheduler.auto_schedule.SearchTask"><em>SearchTask</em></a>) – The SearchTask for the computation declaration.</p></li>
+<li><p><strong>task</strong> (<a class="reference internal" href="#tvm.auto_scheduler.SearchTask" title="tvm.auto_scheduler.SearchTask"><em>SearchTask</em></a>) – The SearchTask for the computation declaration.</p></li>
 <li><p><strong>search_policy</strong> (<em>Optional</em><em>[</em><em>SearchPolicy</em><em>]</em>) – The search policy to be used for schedule search.</p></li>
-<li><p><strong>tuning_options</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="#tvm.auto_scheduler.auto_schedule.TuningOptions" title="tvm.auto_scheduler.auto_schedule.TuningOptions"><em>TuningOptions</em></a><em>]</em>) – Tuning and measurement options.</p></li>
+<li><p><strong>tuning_options</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="#tvm.auto_scheduler.TuningOptions" title="tvm.auto_scheduler.TuningOptions"><em>TuningOptions</em></a><em>]</em>) – Tuning and measurement options.</p></li>
+</ul>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p></p>
+</dd>
+<dt class="field-odd">Return type</dt>
+<dd class="field-odd"><p>A <cite>te.Schedule</cite> and the a list of <cite>te.Tensor</cite> to be used in <cite>tvm.lower</cite> or <cite>tvm.build</cite>.</p>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py class">
+<dt id="tvm.auto_scheduler.ComputeDAG">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">ComputeDAG</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">compute</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.ComputeDAG" title="Permalink to this definition">¶</a></dt>
+<dd><p>The auto-scheduler’s computational graph and related program analyses.</p>
+<p>We convert a compute declaration described by <cite>tvm.compute</cite> (could be a single operator or a
+subgraph) to a ComputeDAG. It keeps the input/output tensors, all operations in the DAG, and
+some static analysis results for the DAG (e.g. the total float operation count,
+consumer/producer relations of operations, whether an operation stage should
+be tiled/compute inlined).
+These analyses can help the search policy to make decisions during the search.
+ComputeDAG is also responsible for the interaction between auto-scheduler’s <cite>LoopState</cite> and
+TVM schedule (e.g. applying the <cite>LoopState</cite> transform steps to a TVM schedule, providing
+<cite>LoopState</cite> with extra information got from TVM schedule).</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><p><strong>compute</strong> (<em>Union</em><em>[</em><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List"><em>List</em></a><em>[</em><a class="reference internal" href="te.html#tvm.te.Tensor" title="tvm.te.Tensor"><em>Tensor</em></a><em>]</em><em>, </em><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><e [...]
+</dd>
+</dl>
+<p><strong>Methods</strong></p>
+<table class="longtable docutils align-default">
+<colgroup>
+<col style="width: 10%" />
+<col style="width: 90%" />
+</colgroup>
+<tbody>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.ComputeDAG.apply_steps_from_state" title="tvm.auto_scheduler.ComputeDAG.apply_steps_from_state"><code class="xref py py-obj docutils literal notranslate"><span class="pre">apply_steps_from_state</span></code></a>(state[, layout_rewrite])</p></td>
+<td><p>Apply the history transform steps from a State to get a TVM schedule.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.ComputeDAG.get_init_state" title="tvm.auto_scheduler.ComputeDAG.get_init_state"><code class="xref py py-obj docutils literal notranslate"><span class="pre">get_init_state</span></code></a>()</p></td>
+<td><p>Get the init state of this ComputeDAG.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.ComputeDAG.infer_bound_from_state" title="tvm.auto_scheduler.ComputeDAG.infer_bound_from_state"><code class="xref py py-obj docutils literal notranslate"><span class="pre">infer_bound_from_state</span></code></a>(state)</p></td>
+<td><p>Infer and fill the bound of all iterators of a state.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.ComputeDAG.print_python_code_from_state" title="tvm.auto_scheduler.ComputeDAG.print_python_code_from_state"><code class="xref py py-obj docutils literal notranslate"><span class="pre">print_python_code_from_state</span></code></a>(state)</p></td>
+<td><p>Print transform steps in the history of a State as TVM’s python schedule code.</p></td>
+</tr>
+</tbody>
+</table>
+<dl class="py method">
+<dt id="tvm.auto_scheduler.ComputeDAG.get_init_state">
+<code class="sig-name descname">get_init_state</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.ComputeDAG.get_init_state" title="Permalink to this definition">¶</a></dt>
+<dd><p>Get the init state of this ComputeDAG.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Returns</dt>
+<dd class="field-odd"><p><strong>state</strong> – The initial State without any transform steps.</p>
+</dd>
+<dt class="field-even">Return type</dt>
+<dd class="field-even"><p>State</p>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py method">
+<dt id="tvm.auto_scheduler.ComputeDAG.apply_steps_from_state">
+<code class="sig-name descname">apply_steps_from_state</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">state</span></em>, <em class="sig-param"><span class="n">layout_rewrite</span><span class="o">=</span><span class="default_value">False</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.ComputeDAG.apply_steps_from_state" title="Permalink to this definition">¶</a></dt>
+<dd><p>Apply the history transform steps from a State to get a TVM schedule.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>state</strong> (<em>Union</em><em>[</em><em>State</em><em>, </em><em>StateObject</em><em>]</em>) – The state from which we get transform steps.</p></li>
+<li><p><strong>layout_rewrite</strong> (<em>Bool</em>) – Rewrite the layout of placeholders specified by “layout_free_placeholders” attr
+to make it most friendly for the generated schedule to read from.</p></li>
 </ul>
 </dd>
 <dt class="field-even">Returns</dt>
@@ -333,60 +500,302 @@ Can be the a function or the function name.</p></li>
 </dl>
 </dd></dl>
 
-</div>
-<div class="section" id="tvm-auto-scheduler-workload-registry">
-<h2>tvm.auto_scheduler.workload_registry<a class="headerlink" href="#tvm-auto-scheduler-workload-registry" title="Permalink to this headline">¶</a></h2>
-<dl class="py function">
-<dt id="tvm.auto_scheduler.workload_registry.register_workload">
-<code class="sig-prename descclassname">tvm.auto_scheduler.workload_registry.</code><code class="sig-name descname">register_workload</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">func_name</span></em>, <em class="sig-param"><span class="n">f</span><span class="o">=</span><span class="default_value">None</span></em>, <em class="sig-param"><span class="n">override</span><span class="o">=</span><span class="default_value">False</span></em><span class="sig-pare [...]
-<dd><p>Register a function that generates a certain workload.</p>
-<p>The input function should take hashable and jsonable arguments
-(int, float, tuple of int, tvm.tensor.Tensor, …) and return a list of tvm.tensor.Tensor.</p>
+<dl class="py method">
+<dt id="tvm.auto_scheduler.ComputeDAG.print_python_code_from_state">
+<code class="sig-name descname">print_python_code_from_state</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">state</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.ComputeDAG.print_python_code_from_state" title="Permalink to this definition">¶</a></dt>
+<dd><p>Print transform steps in the history of a State as TVM’s python schedule code.</p>
+<p>This is used to print transformation steps for debugging.
+Use <cite>apply_steps_from_state</cite> if you want to get a schedule for code generation.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><p><strong>state</strong> (<em>Union</em><em>[</em><em>State</em><em>, </em><em>StateObject</em><em>]</em>) – The state from which we get transform steps.</p>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p><strong>str</strong> – The Python schedule code.</p>
+</dd>
+<dt class="field-odd">Return type</dt>
+<dd class="field-odd"><p>Str</p>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py method">
+<dt id="tvm.auto_scheduler.ComputeDAG.infer_bound_from_state">
+<code class="sig-name descname">infer_bound_from_state</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">state</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.ComputeDAG.infer_bound_from_state" title="Permalink to this definition">¶</a></dt>
+<dd><p>Infer and fill the bound of all iterators of a state.</p>
+<p>The states may lose complete bound information after some transform steps
+(e.g., compute_at).
+We can call this function to infer and fill all the bound information.
+This function calls TVM InferBound pass internally to get the bound.
+The returned state of this function is guaranteed to have complete iterator extent
+information.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><p><strong>state</strong> (<em>Union</em><em>[</em><em>State</em><em>, </em><em>StateObject</em><em>]</em>) – The state from which we get transform steps.</p>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p><strong>updated_state</strong> – The State with complete bound information.</p>
+</dd>
+<dt class="field-odd">Return type</dt>
+<dd class="field-odd"><p>State</p>
+</dd>
+</dl>
+</dd></dl>
+
+</dd></dl>
+
+<dl class="py class">
+<dt id="tvm.auto_scheduler.RandomModel">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">RandomModel</code><a class="headerlink" href="#tvm.auto_scheduler.RandomModel" title="Permalink to this definition">¶</a></dt>
+<dd><p>A model returns random estimation for all inputs</p>
+<p><strong>Methods</strong></p>
+<table class="longtable docutils align-default">
+<colgroup>
+<col style="width: 10%" />
+<col style="width: 90%" />
+</colgroup>
+<tbody>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.RandomModel.predict" title="tvm.auto_scheduler.RandomModel.predict"><code class="xref py py-obj docutils literal notranslate"><span class="pre">predict</span></code></a>(search_task, states)</p></td>
+<td><p>Predict the scores of states</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.RandomModel.update" title="tvm.auto_scheduler.RandomModel.update"><code class="xref py py-obj docutils literal notranslate"><span class="pre">update</span></code></a>(inputs, results)</p></td>
+<td><p>Update the cost model according to new measurement results (training data).</p></td>
+</tr>
+</tbody>
+</table>
+<dl class="py method">
+<dt id="tvm.auto_scheduler.RandomModel.update">
+<code class="sig-name descname">update</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">inputs</span></em>, <em class="sig-param"><span class="n">results</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.RandomModel.update" title="Permalink to this definition">¶</a></dt>
+<dd><p>Update the cost model according to new measurement results (training data).</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
 <dd class="field-odd"><ul class="simple">
-<li><p><strong>func_name</strong> (<em>Union</em><em>[</em><a class="reference internal" href="relay/index.html#tvm.relay.Function" title="tvm.relay.Function"><em>Function</em></a><em>, </em><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em>]</em>) – The generation function that returns the compute declaration Tensors or its function name.</p></li>
-<li><p><strong>f</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="relay/index.html#tvm.relay.Function" title="tvm.relay.Function"><em>Function</em></a><em>]</em>) – The generation function to be registered.</p></li>
-<li><p><strong>override</strong> (<em>boolean = False</em>) – Whether override existing entry.</p></li>
+<li><p><strong>inputs</strong> (<a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List"><em>List</em></a><em>[</em><em>auto_scheduler.measure.MeasureInput</em><em>]</em>) – The measurement inputs</p></li>
+<li><p><strong>results</strong> (<a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List"><em>List</em></a><em>[</em><em>auto_scheduler.measure.MeasureResult</em><em>]</em>) – The measurement results</p></li>
 </ul>
 </dd>
 </dl>
-<p class="rubric">Examples</p>
-<div class="highlight-python notranslate"><div class="highlight"><pre><span class="nd">@auto_scheduler.register_workload</span>
-<span class="k">def</span> <span class="nf">matmul</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">M</span><span class="p">,</span> <span class="n">K</span><span class="p">):</span>
-    <span class="n">A</span> <span class="o">=</span> <span class="n">te</span><span class="o">.</span><span class="n">placeholder</span><span class="p">((</span><span class="n">N</span><span class="p">,</span> <span class="n">K</span><span class="p">),</span> <span class="n">name</span><span class="o">=</span><span class="s1">&#39;A&#39;</span><span class="p">)</span>
-    <span class="n">B</span> <span class="o">=</span> <span class="n">te</span><span class="o">.</span><span class="n">placeholder</span><span class="p">((</span><span class="n">K</span><span class="p">,</span> <span class="n">M</span><span class="p">),</span> <span class="n">name</span><span class="o">=</span><span class="s1">&#39;B&#39;</span><span class="p">)</span>
-    <span class="n">k</span> <span class="o">=</span> <span class="n">te</span><span class="o">.</span><span class="n">reduce_axis</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="n">K</span><span class="p">),</span> <span class="n">name</span><span class="o">=</span><span class="s1">&#39;k&#39;</span><span class="p">)</span>
-    <span class="n">C</span> <span class="o">=</span> <span class="n">te</span><span class="o">.</span><span class="n">compute</span><span class="p">((</span><span class="n">N</span><span class="p">,</span> <span class="n">M</span><span class="p">),</span> <span class="k">lambda</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">:</span> <span class="n">tvm</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span c [...]
-    <span class="k">return</span> <span class="p">[</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">]</span>
-</pre></div>
-</div>
 </dd></dl>
 
-</div>
-<div class="section" id="module-tvm.auto_scheduler.measure">
-<span id="tvm-auto-scheduler-measure"></span><h2>tvm.auto_scheduler.measure<a class="headerlink" href="#module-tvm.auto_scheduler.measure" title="Permalink to this headline">¶</a></h2>
-<p>Distributed measurement infrastructure to measure the runtime costs of tensor programs.</p>
-<p>These functions are responsible for building the tvm module, uploading it to
-remote devices, recording the running time costs, and checking the correctness of the output.</p>
-<p>We separate the measurement into two steps: build and run.
-A builder builds the executable binary files and a runner runs the binary files to
-get the measurement results. The flow of data structures is</p>
+<dl class="py method">
+<dt id="tvm.auto_scheduler.RandomModel.predict">
+<code class="sig-name descname">predict</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">search_task</span></em>, <em class="sig-param"><span class="n">states</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.RandomModel.predict" title="Permalink to this definition">¶</a></dt>
+<dd><p>Predict the scores of states</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>search_task</strong> (<a class="reference internal" href="#tvm.auto_scheduler.SearchTask" title="tvm.auto_scheduler.SearchTask"><em>SearchTask</em></a>) – The search task of states</p></li>
+<li><p><strong>states</strong> (<a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List"><em>List</em></a><em>[</em><em>State</em><em>]</em>) – The input states</p></li>
+</ul>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p><strong>scores</strong> – The predicted scores for all states</p>
+</dd>
+<dt class="field-odd">Return type</dt>
+<dd class="field-odd"><p><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List">List</a>[<a class="reference external" href="https://docs.python.org/3/library/functions.html#float" title="(in Python v3.8)">float</a>]</p>
+</dd>
+</dl>
+</dd></dl>
+
+</dd></dl>
+
+<dl class="py class">
+<dt id="tvm.auto_scheduler.XGBModel">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">XGBModel</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">verbose_eval</span><span class="o">=</span><span class="default_value">25</span></em>, <em class="sig-param"><span class="n">num_warmup_sample</span><span class="o">=</span><span class="default_value">100</span></em>, <em class="sig-param"><span class="n">seed</span><span clas [...]
+<dd><p>Train a XGBoost model to predict the normalized throughputs of programs.
+Let the normalized throughput be the score of a program (higher is better). We predict
+the (approximiate) score of a program = the sum of the scores of all stages in this program.
+i.e. score(P) = score_s0 + score_s1 + … + score_sn,
+where score_si is the score of Stage i in Program P.
+We extract feature for each stage and let the xgboost predict the score for each stage.
+We then sum up the predictions as the score of the whole program.
+We use RMSE as the loss function.  i.e. loss(P, y) = 1/2 * (score(P) - y)^2,
+where P is the program and y is the normalized throughput according to
+the ground truth (measurement).
+XGBoost does not support this loss function because <cite>score(P)</cite> is a sum of the prediction
+of several samples, so we implemented a custom loss function and call it pack-sum-rmse.
+It is called “pack-sum” because we combine several samples into a “pack” and sum up
+their predictions.</p>
+<p><strong>Methods</strong></p>
+<table class="longtable docutils align-default">
+<colgroup>
+<col style="width: 10%" />
+<col style="width: 90%" />
+</colgroup>
+<tbody>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.XGBModel.load" title="tvm.auto_scheduler.XGBModel.load"><code class="xref py py-obj docutils literal notranslate"><span class="pre">load</span></code></a>(file_name)</p></td>
+<td><p>Load the model from a file</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.XGBModel.predict" title="tvm.auto_scheduler.XGBModel.predict"><code class="xref py py-obj docutils literal notranslate"><span class="pre">predict</span></code></a>(task, states)</p></td>
+<td><p>Predict the scores of states</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.XGBModel.predict_stages" title="tvm.auto_scheduler.XGBModel.predict_stages"><code class="xref py py-obj docutils literal notranslate"><span class="pre">predict_stages</span></code></a>(task, states)</p></td>
+<td><p>Predict the scores of all stages in states.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.XGBModel.save" title="tvm.auto_scheduler.XGBModel.save"><code class="xref py py-obj docutils literal notranslate"><span class="pre">save</span></code></a>(file_name)</p></td>
+<td><p>Save the model to a file</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.XGBModel.update" title="tvm.auto_scheduler.XGBModel.update"><code class="xref py py-obj docutils literal notranslate"><span class="pre">update</span></code></a>(inputs, results)</p></td>
+<td><p>Update the cost model according to new measurement results (training data).</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.XGBModel.update_from_file" title="tvm.auto_scheduler.XGBModel.update_from_file"><code class="xref py py-obj docutils literal notranslate"><span class="pre">update_from_file</span></code></a>(file_name[, n_lines])</p></td>
+<td><p>Load measure records from a log file to update the cost model.</p></td>
+</tr>
+</tbody>
+</table>
+<dl class="py method">
+<dt id="tvm.auto_scheduler.XGBModel.update">
+<code class="sig-name descname">update</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">inputs</span></em>, <em class="sig-param"><span class="n">results</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.XGBModel.update" title="Permalink to this definition">¶</a></dt>
+<dd><p>Update the cost model according to new measurement results (training data).
+XGBoost does not support incremental training, so we re-train a new model every time.
+:param inputs: The measurement inputs
+:type inputs: List[MeasureInput]
+:param results: The measurement results
+:type results: List[MeasureResult]</p>
+</dd></dl>
+
+<dl class="py method">
+<dt id="tvm.auto_scheduler.XGBModel.predict">
+<code class="sig-name descname">predict</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">task</span></em>, <em class="sig-param"><span class="n">states</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.XGBModel.predict" title="Permalink to this definition">¶</a></dt>
+<dd><p>Predict the scores of states
+:param search_task: The search task of states
+:type search_task: SearchTask
+:param statse: The input states
+:type statse: List[State]</p>
+<dl class="field-list simple">
+<dt class="field-odd">Returns</dt>
+<dd class="field-odd"><p><strong>scores</strong> – The predicted scores for all states</p>
+</dd>
+<dt class="field-even">Return type</dt>
+<dd class="field-even"><p><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List">List</a>[<a class="reference external" href="https://docs.python.org/3/library/functions.html#float" title="(in Python v3.8)">float</a>]</p>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py method">
+<dt id="tvm.auto_scheduler.XGBModel.predict_stages">
+<code class="sig-name descname">predict_stages</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">task</span></em>, <em class="sig-param"><span class="n">states</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.XGBModel.predict_stages" title="Permalink to this definition">¶</a></dt>
+<dd><p>Predict the scores of all stages in states. This is the breakdown version of <cite>predict</cite>.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>search_task</strong> (<a class="reference internal" href="#tvm.auto_scheduler.SearchTask" title="tvm.auto_scheduler.SearchTask"><em>SearchTask</em></a>) – The search task of states</p></li>
+<li><p><strong>statse</strong> (<a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List"><em>List</em></a><em>[</em><em>State</em><em>]</em>) – The input states</p></li>
+</ul>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p><strong>scores</strong> – The predicted scores for all stages in all states in the packed format</p>
+</dd>
+<dt class="field-odd">Return type</dt>
+<dd class="field-odd"><p><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List">List</a>[<a class="reference external" href="https://docs.python.org/3/library/functions.html#float" title="(in Python v3.8)">float</a>]</p>
+</dd>
+</dl>
+<div class="admonition note">
+<p class="admonition-title">Note</p>
+<p>For faster data copy between c++ and python, the python part returns scores in a
+single flatten array using a packed format. The c++ part then unpacks the flatten array.
+The packed format is:
+{</p>
 <blockquote>
-<div><p>.                <cite>ProgramBuilder</cite>                 <cite>ProgramRunner</cite>
-<cite>MeasureInput</cite> —————–&gt; <cite>BuildResult</cite> —————-&gt; <cite>MeasureResult</cite></p>
+<div><p>float  scores[N];                 // scores[i] is the score for states[i].
+int    n_stage_0;                 // the number of stages in states[0]
+float  stage_scores_0[[n_stage_0] // the scores for all stages in states[0]
+int    n_stage_1;                 // the number of stages in states[1]
+float  stage_scores_1[n_stage_1]; // the scores for all stages in states[1]
+…
+int    n_stage_i;                 // the number of stages in states[i]
+float  stage_scores_1[n_stage_i]; // the scores for all stages in states[i]
+…  // untill i == N - 1</p>
 </div></blockquote>
-<p>We implement these in python to utilize python’s multiprocessing and error handling.</p>
+<p>}
+To implement this format, we also store int as float, so we can store all numbers
+into a single float array.</p>
+</div>
+</dd></dl>
+
+<dl class="py method">
+<dt id="tvm.auto_scheduler.XGBModel.update_from_file">
+<code class="sig-name descname">update_from_file</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">file_name</span></em>, <em class="sig-param"><span class="n">n_lines</span><span class="o">=</span><span class="default_value">None</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.XGBModel.update_from_file" title="Permalink to this definition">¶</a></dt>
+<dd><p>Load measure records from a log file to update the cost model.
+This function can be used to pre-train the cost model with history log files.
+:param file_name: The filename
+:type file_name: str
+:param n_lines: Only load first n lines of the log file
+:type n_lines: Optional[int]</p>
+</dd></dl>
+
+<dl class="py method">
+<dt id="tvm.auto_scheduler.XGBModel.save">
+<code class="sig-name descname">save</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">file_name</span><span class="p">:</span> <span class="n"><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)">str</a></span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.XGBModel.save" title="Permalink to this definition">¶</a></dt>
+<dd><p>Save the model to a file
+:param file_name: The filename
+:type file_name: str</p>
+</dd></dl>
+
+<dl class="py method">
+<dt id="tvm.auto_scheduler.XGBModel.load">
+<code class="sig-name descname">load</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">file_name</span><span class="p">:</span> <span class="n"><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)">str</a></span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.XGBModel.load" title="Permalink to this definition">¶</a></dt>
+<dd><p>Load the model from a file
+:param file_name: The filename
+:type file_name: str</p>
+</dd></dl>
+
+</dd></dl>
+
 <dl class="py class">
-<dt id="tvm.auto_scheduler.measure.LocalRPCMeasureContext">
-<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.measure.</code><code class="sig-name descname">LocalRPCMeasureContext</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">priority</span><span class="o">=</span><span class="default_value">1</span></em>, <em class="sig-param"><span class="n">n_parallel</span><span class="o">=</span><span class="default_value">1</span></em>, <em class="sig-param"><span class="n">timeout</span [...]
-<dd><p>A context wrapper for running RPCRunner locally.
-This will launch a local RPC Tracker and local RPC Server.</p>
+<dt id="tvm.auto_scheduler.MeasureInput">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">MeasureInput</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">task</span></em>, <em class="sig-param"><span class="n">state</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.MeasureInput" title="Permalink to this definition">¶</a></dt>
+<dd><p>Store the input of a measurement.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>task</strong> (<a class="reference internal" href="#tvm.auto_scheduler.SearchTask" title="tvm.auto_scheduler.SearchTask"><em>SearchTask</em></a>) – The SearchTask of this measurement.</p></li>
+<li><p><strong>state</strong> (<em>Union</em><em>[</em><em>State</em><em>, </em><em>StateObject</em><em>]</em>) – The State to be measured.</p></li>
+</ul>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py class">
+<dt id="tvm.auto_scheduler.MeasureResult">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">MeasureResult</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">costs</span></em>, <em class="sig-param"><span class="n">error_no</span></em>, <em class="sig-param"><span class="n">error_msg</span></em>, <em class="sig-param"><span class="n">all_cost</span></em>, <em class="sig-param"><span class="n">timestamp</span></em><span class=" [...]
+<dd><p>Store the results of a measurement.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>costs</strong> (<a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List"><em>List</em></a><em>[</em><a class="reference external" href="https://docs.python.org/3/library/functions.html#float" title="(in Python v3.8)"><em>float</em></a><em>]</em>) – The time costs of execution.</p></li>
+<li><p><strong>error_no</strong> (<a class="reference external" href="https://docs.python.org/3/library/functions.html#int" title="(in Python v3.8)"><em>int</em></a>) – The error code.</p></li>
+<li><p><strong>error_msg</strong> (<em>Optional</em><em>[</em><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em>]</em>) – The error message if there is any error.</p></li>
+<li><p><strong>all_cost</strong> (<a class="reference external" href="https://docs.python.org/3/library/functions.html#float" title="(in Python v3.8)"><em>float</em></a>) – The time cost of build and run.</p></li>
+<li><p><strong>timestamp</strong> (<a class="reference external" href="https://docs.python.org/3/library/functions.html#float" title="(in Python v3.8)"><em>float</em></a>) – The time stamps of this measurement.</p></li>
+</ul>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py class">
+<dt id="tvm.auto_scheduler.LocalBuilder">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">LocalBuilder</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">timeout</span><span class="o">=</span><span class="default_value">15</span></em>, <em class="sig-param"><span class="n">n_parallel</span><span class="o">=</span><span class="default_value">8</span></em>, <em class="sig-param"><span class="n">build_func</span><span class="o [...]
+<dd><p>LocalBuilder use local CPU cores to build programs in parallel.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>timeout</strong> (<em>int = 15</em>) – The timeout limit (in second) for each build thread.
+This is used in a wrapper of the multiprocessing.Process.join().</p></li>
+<li><p><strong>n_parallel</strong> (<em>int = multiprocessing.cpu_count</em><em>(</em><em>)</em>) – Number of threads used to build in parallel.</p></li>
+<li><p><strong>build_func</strong> (<em>str = 'default'</em>) – The name of registered build function.</p></li>
+</ul>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py class">
+<dt id="tvm.auto_scheduler.LocalRunner">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">LocalRunner</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">timeout</span><span class="o">=</span><span class="default_value">10</span></em>, <em class="sig-param"><span class="n">number</span><span class="o">=</span><span class="default_value">3</span></em>, <em class="sig-param"><span class="n">repeat</span><span class="o">=</span [...]
+<dd><p>LocalRunner that uses local CPU/GPU to measures the time cost of programs.</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
 <dd class="field-odd"><ul class="simple">
-<li><p><strong>priority</strong> (<em>int = 1</em>) – The priority of this run request, larger is more prior.</p></li>
-<li><p><strong>n_parallel</strong> (<em>int = 1</em>) – The number of tasks run in parallel.</p></li>
 <li><p><strong>timeout</strong> (<em>int = 10</em>) – The timeout limit (in second) for each run.
 This is used in a wrapper of the multiprocessing.Process.join().</p></li>
 <li><p><strong>number</strong> (<em>int = 3</em>) – The number of times to run the generated code for taking average.
@@ -396,7 +805,7 @@ In total, the generated code will be run (1 + number x repeat) times,
 where the first “1” is warm up and will be discarded.
 The returned result contains <cite>repeat</cite> costs,
 each of which is an average of <cite>number</cite> costs.</p></li>
-<li><p><strong>min_repeat_ms</strong> (<em>int = 0</em>) – The minimum duration of one <cite>repeat</cite> in milliseconds.
+<li><p><strong>min_repeat_ms</strong> (<em>int = 100</em>) – The minimum duration of one <cite>repeat</cite> in milliseconds.
 By default, one <cite>repeat</cite> contains <cite>number</cite> runs. If this parameter is set,
 the parameters <cite>number</cite> will be dynamically adjusted to meet the
 minimum duration requirement of one <cite>repeat</cite>.
@@ -414,12 +823,19 @@ This is only has effect on CPU task.</p></li>
 </dd></dl>
 
 <dl class="py class">
-<dt id="tvm.auto_scheduler.measure.LocalRunner">
-<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.measure.</code><code class="sig-name descname">LocalRunner</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">timeout</span><span class="o">=</span><span class="default_value">10</span></em>, <em class="sig-param"><span class="n">number</span><span class="o">=</span><span class="default_value">3</span></em>, <em class="sig-param"><span class="n">repeat</span><span class="o" [...]
-<dd><p>LocalRunner that uses local CPU/GPU to measures the time cost of programs.</p>
+<dt id="tvm.auto_scheduler.RPCRunner">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">RPCRunner</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">key</span></em>, <em class="sig-param"><span class="n">host</span></em>, <em class="sig-param"><span class="n">port</span></em>, <em class="sig-param"><span class="n">priority</span><span class="o">=</span><span class="default_value">1</span></em>, <em class="sig-param"><span [...]
+<dd><p>RPCRunner that uses RPC call to measures the time cost of programs on remote devices.
+Or sometime we may need to use RPC even in local running to insulate the thread environment.
+(e.g. running CUDA programs)</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
 <dd class="field-odd"><ul class="simple">
+<li><p><strong>key</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a>) – The key of the device registered in the RPC tracker.</p></li>
+<li><p><strong>host</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a>) – The host address of the RPC Tracker.</p></li>
+<li><p><strong>port</strong> (<a class="reference external" href="https://docs.python.org/3/library/functions.html#int" title="(in Python v3.8)"><em>int</em></a>) – The port of RPC Tracker.</p></li>
+<li><p><strong>priority</strong> (<em>int = 1</em>) – The priority of this run request, larger is more prior.</p></li>
+<li><p><strong>n_parallel</strong> (<em>int = 1</em>) – The number of tasks run in parallel.</p></li>
 <li><p><strong>timeout</strong> (<em>int = 10</em>) – The timeout limit (in second) for each run.
 This is used in a wrapper of the multiprocessing.Process.join().</p></li>
 <li><p><strong>number</strong> (<em>int = 3</em>) – The number of times to run the generated code for taking average.
@@ -447,33 +863,13 @@ This is only has effect on CPU task.</p></li>
 </dd></dl>
 
 <dl class="py class">
-<dt id="tvm.auto_scheduler.measure.LocalBuilder">
-<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.measure.</code><code class="sig-name descname">LocalBuilder</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">timeout</span><span class="o">=</span><span class="default_value">15</span></em>, <em class="sig-param"><span class="n">n_parallel</span><span class="o">=</span><span class="default_value">8</span></em>, <em class="sig-param"><span class="n">build_func</span><span  [...]
-<dd><p>LocalBuilder use local CPU cores to build programs in parallel.</p>
-<dl class="field-list simple">
-<dt class="field-odd">Parameters</dt>
-<dd class="field-odd"><ul class="simple">
-<li><p><strong>timeout</strong> (<em>int = 15</em>) – The timeout limit (in second) for each build thread.
-This is used in a wrapper of the multiprocessing.Process.join().</p></li>
-<li><p><strong>n_parallel</strong> (<em>int = multiprocessing.cpu_count</em><em>(</em><em>)</em>) – Number of threads used to build in parallel.</p></li>
-<li><p><strong>build_func</strong> (<em>str = 'default'</em>) – The name of registered build function.</p></li>
-</ul>
-</dd>
-</dl>
-</dd></dl>
-
-<dl class="py class">
-<dt id="tvm.auto_scheduler.measure.RPCRunner">
-<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.measure.</code><code class="sig-name descname">RPCRunner</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">key</span></em>, <em class="sig-param"><span class="n">host</span></em>, <em class="sig-param"><span class="n">port</span></em>, <em class="sig-param"><span class="n">priority</span><span class="o">=</span><span class="default_value">1</span></em>, <em class="sig-para [...]
-<dd><p>RPCRunner that uses RPC call to measures the time cost of programs on remote devices.
-Or sometime we may need to use RPC even in local running to insulate the thread environment.
-(e.g. running CUDA programs)</p>
+<dt id="tvm.auto_scheduler.LocalRPCMeasureContext">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">LocalRPCMeasureContext</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">priority</span><span class="o">=</span><span class="default_value">1</span></em>, <em class="sig-param"><span class="n">n_parallel</span><span class="o">=</span><span class="default_value">1</span></em>, <em class="sig-param"><span class="n">timeout</span><span c [...]
+<dd><p>A context wrapper for running RPCRunner locally.
+This will launch a local RPC Tracker and local RPC Server.</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
 <dd class="field-odd"><ul class="simple">
-<li><p><strong>key</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a>) – The key of the device registered in the RPC tracker.</p></li>
-<li><p><strong>host</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a>) – The host address of the RPC Tracker.</p></li>
-<li><p><strong>port</strong> (<a class="reference external" href="https://docs.python.org/3/library/functions.html#int" title="(in Python v3.8)"><em>int</em></a>) – The port of RPC Tracker.</p></li>
 <li><p><strong>priority</strong> (<em>int = 1</em>) – The priority of this run request, larger is more prior.</p></li>
 <li><p><strong>n_parallel</strong> (<em>int = 1</em>) – The number of tasks run in parallel.</p></li>
 <li><p><strong>timeout</strong> (<em>int = 10</em>) – The timeout limit (in second) for each run.
@@ -485,7 +881,7 @@ In total, the generated code will be run (1 + number x repeat) times,
 where the first “1” is warm up and will be discarded.
 The returned result contains <cite>repeat</cite> costs,
 each of which is an average of <cite>number</cite> costs.</p></li>
-<li><p><strong>min_repeat_ms</strong> (<em>int = 100</em>) – The minimum duration of one <cite>repeat</cite> in milliseconds.
+<li><p><strong>min_repeat_ms</strong> (<em>int = 0</em>) – The minimum duration of one <cite>repeat</cite> in milliseconds.
 By default, one <cite>repeat</cite> contains <cite>number</cite> runs. If this parameter is set,
 the parameters <cite>number</cite> will be dynamically adjusted to meet the
 minimum duration requirement of one <cite>repeat</cite>.
@@ -502,7 +898,308 @@ This is only has effect on CPU task.</p></li>
 </dl>
 </dd></dl>
 
+<dl class="py class">
+<dt id="tvm.auto_scheduler.RecordToFile">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">RecordToFile</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">filename</span><span class="o">=</span><span class="default_value">'auto_scheduler_tuning.json'</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.RecordToFile" title="Permalink to this definition">¶</a></dt>
+<dd><p>A measurement callback that writes measurement records into a file.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><p><strong>filename</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a>) – File name for this callback to write log to.</p>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py class">
+<dt id="tvm.auto_scheduler.RecordReader">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">RecordReader</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">filename</span><span class="o">=</span><span class="default_value">'auto_scheduler_tuning.json'</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.RecordReader" title="Permalink to this definition">¶</a></dt>
+<dd><p>Reader of the json log file.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><p><strong>filename</strong> (<em>str = &quot;auto_scheduler_tuning.json&quot;</em>) – File name for this reader to load log from.</p>
+</dd>
+</dl>
+<p><strong>Methods</strong></p>
+<table class="longtable docutils align-default">
+<colgroup>
+<col style="width: 10%" />
+<col style="width: 90%" />
+</colgroup>
+<tbody>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.RecordReader.read_lines" title="tvm.auto_scheduler.RecordReader.read_lines"><code class="xref py py-obj docutils literal notranslate"><span class="pre">read_lines</span></code></a>([max_lines, skip_lines])</p></td>
+<td><p>Read multiple lines from the log file.</p></td>
+</tr>
+</tbody>
+</table>
+<dl class="py method">
+<dt id="tvm.auto_scheduler.RecordReader.read_lines">
+<code class="sig-name descname">read_lines</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">max_lines</span><span class="o">=</span><span class="default_value">None</span></em>, <em class="sig-param"><span class="n">skip_lines</span><span class="o">=</span><span class="default_value">0</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.RecordReader.read_lines" title="Permalink to this definition">¶</a></dt>
+<dd><p>Read multiple lines from the log file.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>max_lines</strong> (<em>Optional</em><em>[</em><a class="reference external" href="https://docs.python.org/3/library/functions.html#int" title="(in Python v3.8)"><em>int</em></a><em>]</em>) – The maximum number of lines. None to read all lines.</p></li>
+<li><p><strong>skip_lines</strong> (<em>int = 0</em>) – Skip the first n lines.</p></li>
+</ul>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p><ul class="simple">
+<li><p><strong>inputs</strong> (<em>List[auto_scheduler.measure.MeasureInput]</em>) – The MeasureInputs loaded from the log file.</p></li>
+<li><p><strong>results</strong> (<em>List[auto_scheduler.measure.MeasureResult]</em>) – The MeasureResults loaded from the log file.</p></li>
+</ul>
+</p>
+</dd>
+</dl>
+</dd></dl>
+
+</dd></dl>
+
+<dl class="py function">
+<dt id="tvm.auto_scheduler.load_best">
+<code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">load_best</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">filename</span></em>, <em class="sig-param"><span class="n">workload_key</span><span class="o">=</span><span class="default_value">None</span></em>, <em class="sig-param"><span class="n">target</span><span class="o">=</span><span class="default_value">None</span></em><span class="sig-paren">)</span><a class [...]
+<dd><p>Return the best measurement pair form a log file. This may return none results if
+there is no legal measure pair with the specified workload_key/target found from the log file.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>filename</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a>) – File name to load log from.</p></li>
+<li><p><strong>workload_key</strong> (<em>Optional</em><em>[</em><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em>]</em>) – The workload key of the compute declaration.
+With <cite>None</cite>, this returns the best measure pair of all workloads.</p></li>
+<li><p><strong>target</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="target.html#tvm.target.Target" title="tvm.target.Target"><em>tvm.target.Target</em></a><em>]</em>) – The target device.
+With <cite>None</cite>, this returns the best measure pair of all target devices.</p></li>
+</ul>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p><ul class="simple">
+<li><p><strong>input</strong> (<em>auto_scheduler.measure.MeasureInput</em>) – The best State’s MeasureInput from this log fine.</p></li>
+<li><p><strong>result</strong> (<em>auto_scheduler.measure.MeasureResult</em>) – The best State’s MeasureResult from this log fine.</p></li>
+</ul>
+</p>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py function">
+<dt id="tvm.auto_scheduler.load_records">
+<code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">load_records</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">filename</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.load_records" title="Permalink to this definition">¶</a></dt>
+<dd><p>Load measurement records from a file.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><p><strong>filename</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a>) – File name to load log from.</p>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p><strong>logs</strong></p>
+</dd>
+<dt class="field-odd">Return type</dt>
+<dd class="field-odd"><p><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List">List</a>[auto_scheduler.measure.MeasureInput, auto_scheduler.measure.MeasureResult]</p>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py function">
+<dt id="tvm.auto_scheduler.save_records">
+<code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">save_records</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">filename</span></em>, <em class="sig-param"><span class="n">inputs</span></em>, <em class="sig-param"><span class="n">results</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.save_records" title="Permalink to this definition">¶</a></dt>
+<dd><p>Append measure records to file.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>filename</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a>) – File name to write log to.</p></li>
+<li><p><strong>inputs</strong> (<a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List"><em>List</em></a><em>[</em><em>MeasureInputs</em><em>]</em>) – The MeasureInputs to be written.</p></li>
+<li><p><strong>results</strong> (<a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List"><em>List</em></a><em>[</em><em>MeasureResults</em><em>]</em>) – The MeasureResults to be written.</p></li>
+</ul>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py class">
+<dt id="tvm.auto_scheduler.EmptyPolicy">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">EmptyPolicy</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">task</span></em>, <em class="sig-param"><span class="n">init_search_callbacks</span><span class="o">=</span><span class="default_value">None</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.EmptyPolicy" title="Permalink to this defi [...]
+<dd><p>This is an example empty search policy which will always generate
+the init state of ComputeDAG.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>task</strong> (<a class="reference internal" href="#tvm.auto_scheduler.SearchTask" title="tvm.auto_scheduler.SearchTask"><em>SearchTask</em></a>) – The SearchTask for the computation declaration.</p></li>
+<li><p><strong>init_search_callbacks</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List"><em>List</em></a><em>[</em><em>SearchCallback</em><em>]</em><em>]</em>) – Callback functions called before the search process.</p></li>
+</ul>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py class">
+<dt id="tvm.auto_scheduler.SketchPolicy">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">SketchPolicy</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">task</span></em>, <em class="sig-param"><span class="n">program_cost_model</span><span class="o">=</span><span class="default_value">auto_scheduler.RandomModel(52161800)</span></em>, <em class="sig-param"><span class="n">params</span><span class="o">=</span><span class="de [...]
+<dd><p>The search policy that searches in a hierarchical search space defined by sketches.
+The policy randomly samples programs from the space defined by sketches and use evolutionary
+search to fine-tune them.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>task</strong> (<a class="reference internal" href="#tvm.auto_scheduler.SearchTask" title="tvm.auto_scheduler.SearchTask"><em>SearchTask</em></a>) – The SearchTask for the computation declaration.</p></li>
+<li><p><strong>program_cost_model</strong> (<em>CostModel = RandomModel</em><em>(</em><em>)</em>) – The cost model to estimate the complete schedules.</p></li>
+<li><p><strong>params</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.Dict" title="tvm.relay.dataflow_pattern.Dict"><em>Dict</em></a><em>[</em><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em>, </em><a class="reference internal" href="tir.html#tvm.tir.Any" title="tvm.tir.Any"><em>Any</em></a><em>]</em><em>]</em>) – Parameters [...]
+See <cite>src/auto_scheduler/search_policy/sketch_search_policy.h</cite> for the definitions.
+See <cite>DEFAULT_PARAMS</cite> below to find the default values.</p></li>
+<li><p><strong>seed</strong> (<em>Optional</em><em>[</em><a class="reference external" href="https://docs.python.org/3/library/functions.html#int" title="(in Python v3.8)"><em>int</em></a><em>]</em>) – Random seed.</p></li>
+<li><p><strong>verbose</strong> (<em>int = 1</em>) – Verbosity level. 0 for silent, 1 to output information during schedule search.</p></li>
+<li><p><strong>init_search_callbacks</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List"><em>List</em></a><em>[</em><em>SearchCallback</em><em>]</em><em>]</em>) – <p>Callback functions called before the search process, usually used to do extra
+initializations.
+Possible callbacks:</p>
+<blockquote>
+<div><ul>
+<li><p>auto_scheduler.PreloadMeasuredStates</p></li>
+<li><p>auto_scheduler.PreloadCustomSketchRule</p></li>
+</ul>
+</div></blockquote>
+<p>TODO(jcf94): Add these search callback implementations.</p>
+</p></li>
+</ul>
+</dd>
+</dl>
+<p><strong>Methods</strong></p>
+<table class="longtable docutils align-default">
+<colgroup>
+<col style="width: 10%" />
+<col style="width: 90%" />
+</colgroup>
+<tbody>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.SketchPolicy.evolutionary_search" title="tvm.auto_scheduler.SketchPolicy.evolutionary_search"><code class="xref py py-obj docutils literal notranslate"><span class="pre">evolutionary_search</span></code></a>(init_populations, out_size)</p></td>
+<td><p>Evolutionary search.</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="#tvm.auto_scheduler.SketchPolicy.generate_sketches" title="tvm.auto_scheduler.SketchPolicy.generate_sketches"><code class="xref py py-obj docutils literal notranslate"><span class="pre">generate_sketches</span></code></a>([print_for_debug])</p></td>
+<td><p>Generate the sketches.</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="#tvm.auto_scheduler.SketchPolicy.sample_initial_population" title="tvm.auto_scheduler.SketchPolicy.sample_initial_population"><code class="xref py py-obj docutils literal notranslate"><span class="pre">sample_initial_population</span></code></a>(pop_size)</p></td>
+<td><p>Sample initial population.</p></td>
+</tr>
+</tbody>
+</table>
+<dl class="py method">
+<dt id="tvm.auto_scheduler.SketchPolicy.generate_sketches">
+<code class="sig-name descname">generate_sketches</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">print_for_debug</span><span class="o">=</span><span class="default_value">False</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.SketchPolicy.generate_sketches" title="Permalink to this definition">¶</a></dt>
+<dd><p>Generate the sketches.
+This python interface is mainly used for debugging and testing.
+The actual search is all done in c++.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><p><strong>print_for_debug</strong> (<em>bool = False</em>) – Whether print out the sketches for debug.</p>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p><strong>sketches</strong> – The generated sketches of this search task.</p>
+</dd>
+<dt class="field-odd">Return type</dt>
+<dd class="field-odd"><p><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List">List</a>[State]</p>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py method">
+<dt id="tvm.auto_scheduler.SketchPolicy.sample_initial_population">
+<code class="sig-name descname">sample_initial_population</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">pop_size</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.SketchPolicy.sample_initial_population" title="Permalink to this definition">¶</a></dt>
+<dd><p>Sample initial population.
+This python interface is mainly used for debugging and testing.
+The actual search is all done in c++.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><p><strong>pop_size</strong> (<a class="reference external" href="https://docs.python.org/3/library/functions.html#int" title="(in Python v3.8)"><em>int</em></a>) – The size of sampled population</p>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p><strong>states</strong> – The sampled states</p>
+</dd>
+<dt class="field-odd">Return type</dt>
+<dd class="field-odd"><p><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List">List</a>[State]</p>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py method">
+<dt id="tvm.auto_scheduler.SketchPolicy.evolutionary_search">
+<code class="sig-name descname">evolutionary_search</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">init_populations</span></em>, <em class="sig-param"><span class="n">out_size</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.SketchPolicy.evolutionary_search" title="Permalink to this definition">¶</a></dt>
+<dd><p>Evolutionary search.
+This python interface is mainly used for debugging and testing.
+The actual search is all done in c++.
+:param init_populations: The initial population states
+:type init_populations: List[State]
+:param out_size: The size of generated states
+:type out_size: int</p>
+<dl class="field-list simple">
+<dt class="field-odd">Returns</dt>
+<dd class="field-odd"><p><strong>states</strong> – The generated states</p>
+</dd>
+<dt class="field-even">Return type</dt>
+<dd class="field-even"><p><a class="reference internal" href="relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.List" title="tvm.relay.dataflow_pattern.List">List</a>[State]</p>
+</dd>
+</dl>
+</dd></dl>
+
+</dd></dl>
+
+<dl class="py class">
+<dt id="tvm.auto_scheduler.PreloadMeasuredStates">
+<em class="property">class </em><code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">PreloadMeasuredStates</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">filename</span><span class="o">=</span><span class="default_value">'auto_scheduler_tuning.json'</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.PreloadMeasuredStates" title="Permalink to this definition">¶</a></dt>
+<dd><p>A SearchCallback to load measured states from the log file for a search policy.</p>
+<dl class="simple">
+<dt>This can resume the state of the search policy:</dt><dd><ul class="simple">
+<li><p>Making sure an already measured state in former searches will never be measured again.</p></li>
+<li><p>The history states can be used to speed up the search process(e.g. SketchPolicy uses
+history states as starting point to perform Evolutionary Search).</p></li>
+</ul>
+</dd>
+</dl>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><p><strong>filename</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a>) – The name of the record file.</p>
+</dd>
+</dl>
+</dd></dl>
+
+<dl class="py function">
+<dt id="tvm.auto_scheduler.register_workload">
+<code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">register_workload</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">func_name</span></em>, <em class="sig-param"><span class="n">f</span><span class="o">=</span><span class="default_value">None</span></em>, <em class="sig-param"><span class="n">override</span><span class="o">=</span><span class="default_value">False</span></em><span class="sig-paren">)</span><a clas [...]
+<dd><p>Register a function that generates a certain workload.</p>
+<p>The input function should take hashable and jsonable arguments
+(int, float, tuple of int, tvm.tensor.Tensor, …) and return a list of tvm.tensor.Tensor.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>func_name</strong> (<em>Union</em><em>[</em><a class="reference internal" href="relay/index.html#tvm.relay.Function" title="tvm.relay.Function"><em>Function</em></a><em>, </em><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em>]</em>) – The generation function that returns the compute declaration Tensors or its function name.</p></li>
+<li><p><strong>f</strong> (<em>Optional</em><em>[</em><a class="reference internal" href="relay/index.html#tvm.relay.Function" title="tvm.relay.Function"><em>Function</em></a><em>]</em>) – The generation function to be registered.</p></li>
+<li><p><strong>override</strong> (<em>boolean = False</em>) – Whether override existing entry.</p></li>
+</ul>
+</dd>
+</dl>
+<p class="rubric">Examples</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span class="nd">@auto_scheduler.register_workload</span>
+<span class="k">def</span> <span class="nf">matmul</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">M</span><span class="p">,</span> <span class="n">K</span><span class="p">):</span>
+    <span class="n">A</span> <span class="o">=</span> <span class="n">te</span><span class="o">.</span><span class="n">placeholder</span><span class="p">((</span><span class="n">N</span><span class="p">,</span> <span class="n">K</span><span class="p">),</span> <span class="n">name</span><span class="o">=</span><span class="s1">&#39;A&#39;</span><span class="p">)</span>
+    <span class="n">B</span> <span class="o">=</span> <span class="n">te</span><span class="o">.</span><span class="n">placeholder</span><span class="p">((</span><span class="n">K</span><span class="p">,</span> <span class="n">M</span><span class="p">),</span> <span class="n">name</span><span class="o">=</span><span class="s1">&#39;B&#39;</span><span class="p">)</span>
+    <span class="n">k</span> <span class="o">=</span> <span class="n">te</span><span class="o">.</span><span class="n">reduce_axis</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="n">K</span><span class="p">),</span> <span class="n">name</span><span class="o">=</span><span class="s1">&#39;k&#39;</span><span class="p">)</span>
+    <span class="n">C</span> <span class="o">=</span> <span class="n">te</span><span class="o">.</span><span class="n">compute</span><span class="p">((</span><span class="n">N</span><span class="p">,</span> <span class="n">M</span><span class="p">),</span> <span class="k">lambda</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">:</span> <span class="n">tvm</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span c [...]
+    <span class="k">return</span> <span class="p">[</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">]</span>
+</pre></div>
 </div>
+</dd></dl>
+
+<dl class="py function">
+<dt id="tvm.auto_scheduler.make_workload_key">
+<code class="sig-prename descclassname">tvm.auto_scheduler.</code><code class="sig-name descname">make_workload_key</code><span class="sig-paren">(</span><em class="sig-param"><span class="n">func</span></em>, <em class="sig-param"><span class="n">args</span></em><span class="sig-paren">)</span><a class="headerlink" href="#tvm.auto_scheduler.make_workload_key" title="Permalink to this definition">¶</a></dt>
+<dd><p>Make a workload key by function and arguments.</p>
+<dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>func</strong> (<em>Union</em><em>[</em><a class="reference internal" href="relay/index.html#tvm.relay.Function" title="tvm.relay.Function"><em>Function</em></a><em>, </em><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em>]</em>) – The function that returns the compute declaration Tensors.
+Can be the a function or the function name.</p></li>
+<li><p><strong>args</strong> (<em>Args</em>) – The args of the function.</p></li>
+</ul>
+</dd>
+<dt class="field-even">Returns</dt>
+<dd class="field-even"><p><strong>workload_key</strong> – The workload key of the function.</p>
+</dd>
+<dt class="field-odd">Return type</dt>
+<dd class="field-odd"><p><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)">str</a></p>
+</dd>
+</dl>
+</dd></dl>
+
 </div>
 
 
diff --git a/docs/api/python/autotvm.html b/docs/api/python/autotvm.html
index d7dbc3c..95e171b 100644
--- a/docs/api/python/autotvm.html
+++ b/docs/api/python/autotvm.html
@@ -249,7 +249,7 @@
 <dd><p>Apply the history best config</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
-<dd class="field-odd"><p><strong>records</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em> or </em><em>iterator of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>M [...]
+<dd class="field-odd"><p><strong>records</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em> or </em><em>iterator of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>autotvm.measure.MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.Meas [...]
 If is str, then it should be the filename of a records log file.
 Each row of this file is an encoded record pair. Otherwise, it is an iterator.</p>
 </dd>
@@ -543,7 +543,7 @@ every measurement pair. See autotvm/tuner/callback.py for some examples.</p></li
 <dd><p>load history data for transfer learning</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
-<dd class="field-odd"><p><strong>data_set</strong> (<em>Array of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>MeasureResult</em></a><em>) </em><em>pair</em>) – Previous tuning records</p>
+<dd class="field-odd"><p><strong>data_set</strong> (<em>Array of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>autotvm.measure.MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>autotvm.measure.MeasureResult</em></a><em>) </em><em>pair</em>) – Previous tuning records</p>
 </dd>
 </dl>
 </dd></dl>
@@ -582,7 +582,7 @@ every measurement pair. See autotvm/tuner/callback.py for some examples.</p></li
 <dd><p>load history data for transfer learning</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
-<dd class="field-odd"><p><strong>data_set</strong> (<em>Array of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>MeasureResult</em></a><em>) </em><em>pair</em>) – Previous tuning records</p>
+<dd class="field-odd"><p><strong>data_set</strong> (<em>Array of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>autotvm.measure.MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>autotvm.measure.MeasureResult</em></a><em>) </em><em>pair</em>) – Previous tuning records</p>
 </dd>
 </dl>
 </dd></dl>
@@ -688,7 +688,7 @@ every measurement pair. See autotvm/tuner/callback.py for some examples.</p></li
 <dd><p>load history data for transfer learning</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
-<dd class="field-odd"><p><strong>data_set</strong> (<em>Array of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>MeasureResult</em></a><em>) </em><em>pair</em>) – Previous tuning records</p>
+<dd class="field-odd"><p><strong>data_set</strong> (<em>Array of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>autotvm.measure.MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>autotvm.measure.MeasureResult</em></a><em>) </em><em>pair</em>) – Previous tuning records</p>
 </dd>
 </dl>
 </dd></dl>
@@ -829,7 +829,7 @@ every measurement pair. See autotvm/tuner/callback.py for some examples.</p></li
 <dd><p>load history data for transfer learning</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
-<dd class="field-odd"><p><strong>data_set</strong> (<em>Array of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>MeasureResult</em></a><em>) </em><em>pair</em>) – Previous tuning records</p>
+<dd class="field-odd"><p><strong>data_set</strong> (<em>Array of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>autotvm.measure.MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>autotvm.measure.MeasureResult</em></a><em>) </em><em>pair</em>) – Previous tuning records</p>
 </dd>
 </dl>
 </dd></dl>
@@ -917,7 +917,7 @@ every measurement pair. See autotvm/tuner/callback.py for some examples.</p></li
 <dd><p>load history data for transfer learning</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
-<dd class="field-odd"><p><strong>data_set</strong> (<em>Array of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>MeasureResult</em></a><em>) </em><em>pair</em>) – Previous tuning records</p>
+<dd class="field-odd"><p><strong>data_set</strong> (<em>Array of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>autotvm.measure.MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>autotvm.measure.MeasureResult</em></a><em>) </em><em>pair</em>) – Previous tuning records</p>
 </dd>
 </dl>
 </dd></dl>
@@ -1852,7 +1852,7 @@ but instead matching by configuration space. The idea is that if two workloads h
 similar configuration space, their optimal configurations are also likely to be similar.</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
-<dd class="field-odd"><p><strong>ref_log</strong> (<em>List of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>MeasureResult</em></a><em>)</em>) – The reference log</p>
+<dd class="field-odd"><p><strong>ref_log</strong> (<em>List of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>autotvm.measure.MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>autotvm.measure.MeasureResult</em></a><em>)</em>) – The reference log</p>
 </dd>
 </dl>
 </dd></dl>
@@ -1966,7 +1966,7 @@ One can construct a new <cite>ConfigEntity</cite> if this is not the case.</p>
 <dd><p>Apply the history best config</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
-<dd class="field-odd"><p><strong>records</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em> or </em><em>iterator of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>M [...]
+<dd class="field-odd"><p><strong>records</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em> or </em><em>iterator of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>autotvm.measure.MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.Meas [...]
 If is str, then it should be the filename of a records log file.
 Each row of this file is an encoded record pair. Otherwise, it is an iterator.</p>
 </dd>
@@ -1977,7 +1977,7 @@ Each row of this file is an encoded record pair. Otherwise, it is an iterator.</
 <dd><p>Load records to this dispatch context</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
-<dd class="field-odd"><p><strong>records</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em> or </em><em>iterator of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>M [...]
+<dd class="field-odd"><p><strong>records</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a><em> or </em><em>iterator of</em><em> (</em><a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>autotvm.measure.MeasureInput</em></a><em>, </em><a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.Meas [...]
 If is str, then it should be the filename of a records log file.
 Each row of this file is an encoded record pair. Otherwise, it is an iterator.</p>
 </dd>
@@ -2317,7 +2317,7 @@ If is callable, decorate this function.</p></li>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
 <dd class="field-odd"><ul class="simple">
-<li><p><strong>inp</strong> (<a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>MeasureInput</em></a>) – input for the measure</p></li>
+<li><p><strong>inp</strong> (<a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>autotvm.measure.MeasureInput</em></a>) – input for the measure</p></li>
 <li><p><strong>include_config</strong> (<a class="reference external" href="https://docs.python.org/3/library/functions.html#bool" title="(in Python v3.8)"><em>bool</em></a><em>, </em><em>optional</em>) – whether includes config in the str key</p></li>
 </ul>
 </dd>
@@ -2337,8 +2337,8 @@ If is callable, decorate this function.</p></li>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
 <dd class="field-odd"><ul class="simple">
-<li><p><strong>inp</strong> (<em>autotvm.tuner.MeasureInput</em>) – </p></li>
-<li><p><strong>result</strong> (<em>autotvm.tuner.MeasureResult</em>) – pair of input/result</p></li>
+<li><p><strong>inp</strong> (<a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput"><em>autotvm.measure.MeasureInput</em></a>) – </p></li>
+<li><p><strong>result</strong> (<a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult"><em>autotvm.measure.MeasureResult</em></a>) – pair of input/result</p></li>
 <li><p><strong>protocol</strong> (<a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#str" title="(in Python v3.8)"><em>str</em></a>) – log protocol, json or pickle</p></li>
 </ul>
 </dd>
@@ -2366,7 +2366,7 @@ If is callable, decorate this function.</p></li>
 <dd class="field-even"><p><strong>ret</strong> – The tuple of input and result, or None if input uses old version log format.</p>
 </dd>
 <dt class="field-odd">Return type</dt>
-<dd class="field-odd"><p><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#tuple" title="(in Python v3.8)">tuple</a>(autotvm.tuner.MeasureInput, autotvm.tuner.MeasureResult), or <a class="reference external" href="https://docs.python.org/3/library/constants.html#None" title="(in Python v3.8)">None</a></p>
+<dd class="field-odd"><p><a class="reference external" href="https://docs.python.org/3/library/stdtypes.html#tuple" title="(in Python v3.8)">tuple</a>(<a class="reference internal" href="#tvm.autotvm.measure.MeasureInput" title="tvm.autotvm.measure.MeasureInput">autotvm.measure.MeasureInput</a>, <a class="reference internal" href="#tvm.autotvm.measure.MeasureResult" title="tvm.autotvm.measure.MeasureResult">autotvm.measure.MeasureResult</a>), or <a class="reference external" href="https: [...]
 </dd>
 </dl>
 </dd></dl>
@@ -2382,8 +2382,8 @@ This is a generator that yields the records.</p>
 </dd>
 <dt class="field-even">Yields</dt>
 <dd class="field-even"><ul class="simple">
-<li><p><strong>input</strong> (<em>autotvm.tuner.MeasureInput</em>)</p></li>
-<li><p><strong>result</strong> (<em>autotvm.tuner.MeasureResult</em>)</p></li>
+<li><p><strong>input</strong> (<em>autotvm.measure.MeasureInput</em>)</p></li>
+<li><p><strong>result</strong> (<em>autotvm.measure.MeasureResult</em>)</p></li>
 </ul>
 </dd>
 </dl>
diff --git a/docs/api/python/index.html b/docs/api/python/index.html
index 6a93241..960f9a3 100644
--- a/docs/api/python/index.html
+++ b/docs/api/python/index.html
@@ -259,12 +259,7 @@
 <li class="toctree-l2"><a class="reference internal" href="autotvm.html#module-tvm.autotvm.record">tvm.autotvm.record</a></li>
 </ul>
 </li>
-<li class="toctree-l1"><a class="reference internal" href="auto_scheduler.html">tvm.auto_scheduler</a><ul>
-<li class="toctree-l2"><a class="reference internal" href="auto_scheduler.html#module-tvm.auto_scheduler.auto_schedule">tvm.auto_scheduler.auto_schedule</a></li>
-<li class="toctree-l2"><a class="reference internal" href="auto_scheduler.html#tvm-auto-scheduler-workload-registry">tvm.auto_scheduler.workload_registry</a></li>
-<li class="toctree-l2"><a class="reference internal" href="auto_scheduler.html#module-tvm.auto_scheduler.measure">tvm.auto_scheduler.measure</a></li>
-</ul>
-</li>
+<li class="toctree-l1"><a class="reference internal" href="auto_scheduler.html">tvm.auto_scheduler</a></li>
 <li class="toctree-l1"><a class="reference internal" href="rpc.html">tvm.rpc</a></li>
 <li class="toctree-l1"><a class="reference internal" href="micro.html">tvm.micro</a></li>
 <li class="toctree-l1"><a class="reference internal" href="contrib.html">tvm.contrib</a><ul>
diff --git a/docs/api/typedoc/classes/bytestreamreader.html b/docs/api/typedoc/classes/bytestreamreader.html
index 912f5ba..b3631df 100644
--- a/docs/api/typedoc/classes/bytestreamreader.html
+++ b/docs/api/typedoc/classes/bytestreamreader.html
@@ -119,7 +119,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -141,7 +141,7 @@
 					<div class="tsd-signature tsd-kind-icon">bytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Uint8Array</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L43">rpc_server.ts:43</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -151,7 +151,7 @@
 					<div class="tsd-signature tsd-kind-icon">offset<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L42">rpc_server.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L42">rpc_server.ts:42</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -168,7 +168,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L63">rpc_server.ts:63</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L63">rpc_server.ts:63</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">Uint8Array</span></h4>
@@ -185,7 +185,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L49">rpc_server.ts:49</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L49">rpc_server.ts:49</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -202,7 +202,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L57">rpc_server.ts:57</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L57">rpc_server.ts:57</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
diff --git a/docs/api/typedoc/classes/cachedcallstack.html b/docs/api/typedoc/classes/cachedcallstack.html
index 7f98c54..35452af 100644
--- a/docs/api/typedoc/classes/cachedcallstack.html
+++ b/docs/api/typedoc/classes/cachedcallstack.html
@@ -144,7 +144,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L223">memory.ts:223</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L223">memory.ts:223</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -172,7 +172,7 @@
 					<div class="tsd-signature tsd-kind-icon">temp<wbr>Args<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><a href="../interfaces/disposable.html" class="tsd-signature-type">Disposable</a><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = []</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L208">memory.ts:208</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L208">memory.ts:208</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -194,7 +194,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L312">memory.ts:312</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L312">memory.ts:312</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -226,7 +226,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L284">memory.ts:284</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L284">memory.ts:284</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -262,7 +262,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L388">memory.ts:388</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L388">memory.ts:388</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -300,7 +300,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L376">memory.ts:376</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L376">memory.ts:376</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -340,7 +340,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L267">memory.ts:267</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L267">memory.ts:267</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -373,7 +373,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L243">memory.ts:243</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L243">memory.ts:243</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -390,7 +390,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L321">memory.ts:321</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L321">memory.ts:321</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -422,7 +422,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L252">memory.ts:252</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L252">memory.ts:252</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -444,7 +444,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L359">memory.ts:359</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L359">memory.ts:359</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -470,7 +470,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L342">memory.ts:342</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L342">memory.ts:342</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -496,7 +496,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L350">memory.ts:350</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L350">memory.ts:350</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -522,7 +522,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L326">memory.ts:326</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L326">memory.ts:326</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -548,7 +548,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L363">memory.ts:363</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L363">memory.ts:363</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -574,7 +574,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L346">memory.ts:346</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L346">memory.ts:346</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -600,7 +600,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L334">memory.ts:334</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L334">memory.ts:334</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
diff --git a/docs/api/typedoc/classes/dlcontext.html b/docs/api/typedoc/classes/dlcontext.html
index c7712be..0826c18 100644
--- a/docs/api/typedoc/classes/dlcontext.html
+++ b/docs/api/typedoc/classes/dlcontext.html
@@ -118,7 +118,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L203">runtime.ts:203</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L203">runtime.ts:203</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -146,7 +146,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<wbr>Id<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L201">runtime.ts:201</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L201">runtime.ts:201</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -161,7 +161,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L199">runtime.ts:199</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L199">runtime.ts:199</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -183,7 +183,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L224">runtime.ts:224</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L224">runtime.ts:224</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -205,7 +205,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L231">runtime.ts:231</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L231">runtime.ts:231</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">string</span></h4>
diff --git a/docs/api/typedoc/classes/dldatatype.html b/docs/api/typedoc/classes/dldatatype.html
index e2e75a5..9398a12 100644
--- a/docs/api/typedoc/classes/dldatatype.html
+++ b/docs/api/typedoc/classes/dldatatype.html
@@ -119,7 +119,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L263">runtime.ts:263</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L263">runtime.ts:263</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -147,7 +147,7 @@
 					<div class="tsd-signature tsd-kind-icon">bits<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L261">runtime.ts:261</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L261">runtime.ts:261</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">code<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L259">runtime.ts:259</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L259">runtime.ts:259</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -177,7 +177,7 @@
 					<div class="tsd-signature tsd-kind-icon">lanes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L263">runtime.ts:263</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L263">runtime.ts:263</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -199,7 +199,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L280">runtime.ts:280</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L280">runtime.ts:280</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -216,7 +216,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L271">runtime.ts:271</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L271">runtime.ts:271</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">string</span></h4>
diff --git a/docs/api/typedoc/classes/environment.html b/docs/api/typedoc/classes/environment.html
index f0e6355..63340ea 100644
--- a/docs/api/typedoc/classes/environment.html
+++ b/docs/api/typedoc/classes/environment.html
@@ -125,7 +125,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/environment.ts#L86">environment.ts:86</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/environment.ts#L86">environment.ts:86</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -169,7 +169,7 @@
 					<aside class="tsd-sources">
 						<p>Implementation of <a href="../interfaces/libraryprovider.html">LibraryProvider</a>.<a href="../interfaces/libraryprovider.html#imports">imports</a></p>
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/environment.ts#L70">environment.ts:70</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/environment.ts#L70">environment.ts:70</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 					<div class="tsd-signature tsd-kind-icon">logger<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>msg<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/environment.ts#L69">environment.ts:69</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/environment.ts#L69">environment.ts:69</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -210,7 +210,7 @@
 					<div class="tsd-signature tsd-kind-icon">packedCFunc<wbr>Table<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">ctypes.FTVMWasmPackedCFunc</span><span class="tsd-signature-symbol"> | </span><span class="tsd-signature-type">undefined</span><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = [undefined,]</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/environment.ts#L78">environment.ts:78</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/environment.ts#L78">environment.ts:78</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -228,7 +228,7 @@
 					<div class="tsd-signature tsd-kind-icon">packedCFunc<wbr>Table<wbr>Free<wbr>Id<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">&gt;</span><span class="tsd-signature-symbol"> = []</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/environment.ts#L84">environment.ts:84</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/environment.ts#L84">environment.ts:84</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -250,7 +250,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/environment.ts#L105">environment.ts:105</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/environment.ts#L105">environment.ts:105</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/api/typedoc/classes/ffilibrary.html b/docs/api/typedoc/classes/ffilibrary.html
index dc6637d..22c6576 100644
--- a/docs/api/typedoc/classes/ffilibrary.html
+++ b/docs/api/typedoc/classes/ffilibrary.html
@@ -131,7 +131,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L49">runtime.ts:49</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L49">runtime.ts:49</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -156,7 +156,7 @@
 					<div class="tsd-signature tsd-kind-icon">exports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">Function</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L46">runtime.ts:46</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L46">runtime.ts:46</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -166,7 +166,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L45">runtime.ts:45</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L45">runtime.ts:45</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">wasm32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">boolean</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L44">runtime.ts:44</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L44">runtime.ts:44</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -186,7 +186,7 @@
 					<div class="tsd-signature tsd-kind-icon">webGPUContext<span class="tsd-signature-symbol">:</span> <a href="webgpucontext.html" class="tsd-signature-type">WebGPUContext</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L47">runtime.ts:47</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L47">runtime.ts:47</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -203,7 +203,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L76">runtime.ts:76</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L76">runtime.ts:76</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -226,7 +226,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L66">runtime.ts:66</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L66">runtime.ts:66</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -243,7 +243,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L84">runtime.ts:84</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L84">runtime.ts:84</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <a href="cachedcallstack.html" class="tsd-signature-type">CachedCallStack</a></h4>
@@ -260,7 +260,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L95">runtime.ts:95</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L95">runtime.ts:95</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -283,7 +283,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L72">runtime.ts:72</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L72">runtime.ts:72</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
diff --git a/docs/api/typedoc/classes/graphruntime.html b/docs/api/typedoc/classes/graphruntime.html
index 8ec806e..fdc3401 100644
--- a/docs/api/typedoc/classes/graphruntime.html
+++ b/docs/api/typedoc/classes/graphruntime.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L584">runtime.ts:584</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L584">runtime.ts:584</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">module<span class="tsd-signature-symbol">:</span> <a href="module.html" class="tsd-signature-type">Module</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L580">runtime.ts:580</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L580">runtime.ts:580</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L655">runtime.ts:655</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L655">runtime.ts:655</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -224,7 +224,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L598">runtime.ts:598</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L598">runtime.ts:598</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -241,7 +241,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L632">runtime.ts:632</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L632">runtime.ts:632</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -279,7 +279,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L645">runtime.ts:645</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L645">runtime.ts:645</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -310,7 +310,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L622">runtime.ts:622</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L622">runtime.ts:622</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -332,7 +332,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L610">runtime.ts:610</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L610">runtime.ts:610</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/api/typedoc/classes/instance.html b/docs/api/typedoc/classes/instance.html
index c026d07..b688d79 100644
--- a/docs/api/typedoc/classes/instance.html
+++ b/docs/api/typedoc/classes/instance.html
@@ -139,7 +139,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L693">runtime.ts:693</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L693">runtime.ts:693</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -202,7 +202,7 @@
 					<div class="tsd-signature tsd-kind-icon">exports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">Function</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L685">runtime.ts:685</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L685">runtime.ts:685</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -212,7 +212,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L684">runtime.ts:684</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L684">runtime.ts:684</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -229,7 +229,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L925">runtime.ts:925</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L925">runtime.ts:925</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -267,7 +267,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L933">runtime.ts:933</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L933">runtime.ts:933</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -298,7 +298,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L995">runtime.ts:995</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L995">runtime.ts:995</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -341,7 +341,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L733">runtime.ts:733</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L733">runtime.ts:733</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -358,7 +358,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L953">runtime.ts:953</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L953">runtime.ts:953</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -402,7 +402,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L817">runtime.ts:817</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L817">runtime.ts:817</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -434,7 +434,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L1038">runtime.ts:1038</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L1038">runtime.ts:1038</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -465,7 +465,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L847">runtime.ts:847</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L847">runtime.ts:847</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -497,7 +497,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L751">runtime.ts:751</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L751">runtime.ts:751</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -520,7 +520,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L1018">runtime.ts:1018</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L1018">runtime.ts:1018</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -568,7 +568,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L790">runtime.ts:790</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L790">runtime.ts:790</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -608,7 +608,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L915">runtime.ts:915</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L915">runtime.ts:915</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -646,7 +646,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L1139">runtime.ts:1139</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L1139">runtime.ts:1139</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -698,7 +698,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L741">runtime.ts:741</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L741">runtime.ts:741</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -722,7 +722,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L869">runtime.ts:869</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L869">runtime.ts:869</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -754,7 +754,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L858">runtime.ts:858</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L858">runtime.ts:858</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -786,7 +786,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L941">runtime.ts:941</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L941">runtime.ts:941</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/api/typedoc/classes/memory.html b/docs/api/typedoc/classes/memory.html
index 3d6b3dc..666b0eb 100644
--- a/docs/api/typedoc/classes/memory.html
+++ b/docs/api/typedoc/classes/memory.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L40">memory.ts:40</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L40">memory.ts:40</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -152,7 +152,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Memory</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L32">memory.ts:32</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L32">memory.ts:32</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -162,7 +162,7 @@
 					<div class="tsd-signature tsd-kind-icon">wasm32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">boolean</span><span class="tsd-signature-symbol"> = true</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L33">memory.ts:33</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L33">memory.ts:33</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -179,7 +179,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L154">memory.ts:154</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L154">memory.ts:154</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -210,7 +210,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L90">memory.ts:90</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L90">memory.ts:90</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -233,7 +233,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L97">memory.ts:97</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L97">memory.ts:97</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -256,7 +256,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L74">memory.ts:74</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L74">memory.ts:74</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -279,7 +279,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L81">memory.ts:81</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L81">memory.ts:81</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -302,7 +302,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L104">memory.ts:104</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L104">memory.ts:104</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -325,7 +325,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L132">memory.ts:132</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L132">memory.ts:132</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -362,7 +362,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L145">memory.ts:145</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L145">memory.ts:145</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -393,7 +393,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L60">memory.ts:60</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L60">memory.ts:60</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -416,7 +416,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L67">memory.ts:67</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L67">memory.ts:67</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -439,7 +439,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L53">memory.ts:53</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L53">memory.ts:53</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -462,7 +462,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L114">memory.ts:114</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L114">memory.ts:114</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -485,7 +485,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L124">memory.ts:124</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L124">memory.ts:124</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">number</span></h4>
@@ -502,7 +502,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/memory.ts#L175">memory.ts:175</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/memory.ts#L175">memory.ts:175</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/api/typedoc/classes/module.html b/docs/api/typedoc/classes/module.html
index 223e4fa..2a6ef10 100644
--- a/docs/api/typedoc/classes/module.html
+++ b/docs/api/typedoc/classes/module.html
@@ -124,7 +124,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L505">runtime.ts:505</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L505">runtime.ts:505</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -170,7 +170,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L503">runtime.ts:503</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L503">runtime.ts:503</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -187,7 +187,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L517">runtime.ts:517</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L517">runtime.ts:517</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -204,7 +204,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L531">runtime.ts:531</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L531">runtime.ts:531</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -236,7 +236,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L562">runtime.ts:562</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L562">runtime.ts:562</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/api/typedoc/classes/ndarray.html b/docs/api/typedoc/classes/ndarray.html
index 5a9c537..98d5354 100644
--- a/docs/api/typedoc/classes/ndarray.html
+++ b/docs/api/typedoc/classes/ndarray.html
@@ -130,7 +130,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L305">runtime.ts:305</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L305">runtime.ts:305</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -158,7 +158,7 @@
 					<div class="tsd-signature tsd-kind-icon">context<span class="tsd-signature-symbol">:</span> <a href="dlcontext.html" class="tsd-signature-type">DLContext</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L298">runtime.ts:298</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L298">runtime.ts:298</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -173,7 +173,7 @@
 					<div class="tsd-signature tsd-kind-icon">dtype<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L294">runtime.ts:294</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L294">runtime.ts:294</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -188,7 +188,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L290">runtime.ts:290</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L290">runtime.ts:290</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -203,7 +203,7 @@
 					<div class="tsd-signature tsd-kind-icon">ndim<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L292">runtime.ts:292</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L292">runtime.ts:292</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -218,7 +218,7 @@
 					<div class="tsd-signature tsd-kind-icon">shape<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L296">runtime.ts:296</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L296">runtime.ts:296</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -240,7 +240,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L371">runtime.ts:371</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L371">runtime.ts:371</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -273,7 +273,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L415">runtime.ts:415</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L415">runtime.ts:415</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -305,7 +305,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L356">runtime.ts:356</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L356">runtime.ts:356</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
@@ -322,7 +322,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L475">runtime.ts:475</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L475">runtime.ts:475</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -346,7 +346,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L444">runtime.ts:444</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L444">runtime.ts:444</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/api/typedoc/classes/packedfunccell.html b/docs/api/typedoc/classes/packedfunccell.html
index c9f2ee8..cfdd04f 100644
--- a/docs/api/typedoc/classes/packedfunccell.html
+++ b/docs/api/typedoc/classes/packedfunccell.html
@@ -122,7 +122,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L158">runtime.ts:158</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L158">runtime.ts:158</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -147,7 +147,7 @@
 					<div class="tsd-signature tsd-kind-icon">handle<span class="tsd-signature-symbol">:</span> <a href="../index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L157">runtime.ts:157</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L157">runtime.ts:157</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -164,7 +164,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L165">runtime.ts:165</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L165">runtime.ts:165</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">void</span></h4>
diff --git a/docs/api/typedoc/classes/rpcserver.html b/docs/api/typedoc/classes/rpcserver.html
index 65281e2..844ce5b 100644
--- a/docs/api/typedoc/classes/rpcserver.html
+++ b/docs/api/typedoc/classes/rpcserver.html
@@ -115,7 +115,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L92">rpc_server.ts:92</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L92">rpc_server.ts:92</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">get<wbr>Imports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">unknown</span><span class="tsd-signat [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L82">rpc_server.ts:82</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L82">rpc_server.ts:82</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -201,7 +201,7 @@
 					<div class="tsd-signature tsd-kind-icon">key<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L78">rpc_server.ts:78</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L78">rpc_server.ts:78</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -211,7 +211,7 @@
 					<div class="tsd-signature tsd-kind-icon">logger<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>msg<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L81">rpc_server.ts:81</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L81">rpc_server.ts:81</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-type-declaration">
@@ -242,7 +242,7 @@
 					<div class="tsd-signature tsd-kind-icon">socket<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">WebSocket</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L79">rpc_server.ts:79</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L79">rpc_server.ts:79</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -252,7 +252,7 @@
 					<div class="tsd-signature tsd-kind-icon">state<span class="tsd-signature-symbol">:</span> <a href="../enums/rpcserverstate.html" class="tsd-signature-type">RPCServerState</a><span class="tsd-signature-symbol"> = RPCServerState.InitHeader</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L80">rpc_server.ts:80</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L80">rpc_server.ts:80</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -262,7 +262,7 @@
 					<div class="tsd-signature tsd-kind-icon">url<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L77">rpc_server.ts:77</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L77">rpc_server.ts:77</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/api/typedoc/classes/scalar.html b/docs/api/typedoc/classes/scalar.html
index c43692c..0a8c77a 100644
--- a/docs/api/typedoc/classes/scalar.html
+++ b/docs/api/typedoc/classes/scalar.html
@@ -112,7 +112,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L145">runtime.ts:145</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L145">runtime.ts:145</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -137,7 +137,7 @@
 					<div class="tsd-signature tsd-kind-icon">dtype<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L145">runtime.ts:145</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L145">runtime.ts:145</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -152,7 +152,7 @@
 					<div class="tsd-signature tsd-kind-icon">value<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L143">runtime.ts:143</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L143">runtime.ts:143</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/api/typedoc/classes/webgpucontext.html b/docs/api/typedoc/classes/webgpucontext.html
index 86e4b97..7f7f81b 100644
--- a/docs/api/typedoc/classes/webgpucontext.html
+++ b/docs/api/typedoc/classes/webgpucontext.html
@@ -120,7 +120,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/webgpu.ts#L57">webgpu.ts:57</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/webgpu.ts#L57">webgpu.ts:57</a></li>
 								</ul>
 							</aside>
 							<h4 class="tsd-parameters-title">Parameters</h4>
@@ -145,7 +145,7 @@
 					<div class="tsd-signature tsd-kind-icon">device<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">GPUDevice</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/webgpu.ts#L50">webgpu.ts:50</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/webgpu.ts#L50">webgpu.ts:50</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -155,7 +155,7 @@
 					<div class="tsd-signature tsd-kind-icon">memory<span class="tsd-signature-symbol">:</span> <a href="memory.html" class="tsd-signature-type">Memory</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/webgpu.ts#L51">webgpu.ts:51</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/webgpu.ts#L51">webgpu.ts:51</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -172,7 +172,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/webgpu.ts#L84">webgpu.ts:84</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/webgpu.ts#L84">webgpu.ts:84</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -209,7 +209,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/webgpu.ts#L170">webgpu.ts:170</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/webgpu.ts#L170">webgpu.ts:170</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -238,7 +238,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/webgpu.ts#L67">webgpu.ts:67</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/webgpu.ts#L67">webgpu.ts:67</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
diff --git a/docs/api/typedoc/enums/argtypecode.html b/docs/api/typedoc/enums/argtypecode.html
index aa8e1f9..371adef 100644
--- a/docs/api/typedoc/enums/argtypecode.html
+++ b/docs/api/typedoc/enums/argtypecode.html
@@ -106,7 +106,7 @@
 					<div class="tsd-signature tsd-kind-icon">Float<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L216">ctypes.ts:216</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L216">ctypes.ts:216</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -116,7 +116,7 @@
 					<div class="tsd-signature tsd-kind-icon">Int<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L214">ctypes.ts:214</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L214">ctypes.ts:214</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -126,7 +126,7 @@
 					<div class="tsd-signature tsd-kind-icon">Null<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L218">ctypes.ts:218</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L218">ctypes.ts:218</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -136,7 +136,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMBytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 12</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L226">ctypes.ts:226</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L226">ctypes.ts:226</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -146,7 +146,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMContext<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 6</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L220">ctypes.ts:220</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L220">ctypes.ts:220</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -156,7 +156,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMDLTensor<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 7</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L221">ctypes.ts:221</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L221">ctypes.ts:221</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -166,7 +166,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMData<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 5</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L219">ctypes.ts:219</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L219">ctypes.ts:219</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -176,7 +176,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMModule<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 9</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L223">ctypes.ts:223</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L223">ctypes.ts:223</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -186,7 +186,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMNDArray<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 13</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L227">ctypes.ts:227</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L227">ctypes.ts:227</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -196,7 +196,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMObject<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L222">ctypes.ts:222</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L222">ctypes.ts:222</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -206,7 +206,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMObjectRValue<wbr>Ref<wbr>Arg<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 14</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L228">ctypes.ts:228</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L228">ctypes.ts:228</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -216,7 +216,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMOpaque<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 3</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L217">ctypes.ts:217</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L217">ctypes.ts:217</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -226,7 +226,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMPacked<wbr>Func<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 10</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L224">ctypes.ts:224</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L224">ctypes.ts:224</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -236,7 +236,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 11</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L225">ctypes.ts:225</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L225">ctypes.ts:225</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -246,7 +246,7 @@
 					<div class="tsd-signature tsd-kind-icon">UInt<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L215">ctypes.ts:215</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L215">ctypes.ts:215</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/api/typedoc/enums/aynccallbackcode.html b/docs/api/typedoc/enums/aynccallbackcode.html
index 093c8b9..80c4dc4 100644
--- a/docs/api/typedoc/enums/aynccallbackcode.html
+++ b/docs/api/typedoc/enums/aynccallbackcode.html
@@ -93,7 +93,7 @@
 					<div class="tsd-signature tsd-kind-icon">k<wbr>Exception<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 5</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L677">runtime.ts:677</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L677">runtime.ts:677</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -103,7 +103,7 @@
 					<div class="tsd-signature tsd-kind-icon">k<wbr>Return<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L676">runtime.ts:676</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L676">runtime.ts:676</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/api/typedoc/enums/dldatatypecode.html b/docs/api/typedoc/enums/dldatatypecode.html
index d48c95c..f3ab6cc 100644
--- a/docs/api/typedoc/enums/dldatatypecode.html
+++ b/docs/api/typedoc/enums/dldatatypecode.html
@@ -95,7 +95,7 @@
 					<div class="tsd-signature tsd-kind-icon">Float<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L243">runtime.ts:243</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L243">runtime.ts:243</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -105,7 +105,7 @@
 					<div class="tsd-signature tsd-kind-icon">Int<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 0</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L241">runtime.ts:241</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L241">runtime.ts:241</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -115,7 +115,7 @@
 					<div class="tsd-signature tsd-kind-icon">Opaque<wbr>Handle<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 3</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L244">runtime.ts:244</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L244">runtime.ts:244</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -125,7 +125,7 @@
 					<div class="tsd-signature tsd-kind-icon">UInt<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L242">runtime.ts:242</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L242">runtime.ts:242</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/api/typedoc/enums/rpcserverstate.html b/docs/api/typedoc/enums/rpcserverstate.html
index bcbcfe3..da655ba 100644
--- a/docs/api/typedoc/enums/rpcserverstate.html
+++ b/docs/api/typedoc/enums/rpcserverstate.html
@@ -90,7 +90,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Header<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L27">rpc_server.ts:27</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L27">rpc_server.ts:27</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -100,7 +100,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Header<wbr>Key<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L28">rpc_server.ts:28</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L28">rpc_server.ts:28</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -110,7 +110,7 @@
 					<div class="tsd-signature tsd-kind-icon">Init<wbr>Server<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L29">rpc_server.ts:29</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L29">rpc_server.ts:29</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -120,7 +120,7 @@
 					<div class="tsd-signature tsd-kind-icon">Receive<wbr>Packet<wbr>Body<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L32">rpc_server.ts:32</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L32">rpc_server.ts:32</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -130,7 +130,7 @@
 					<div class="tsd-signature tsd-kind-icon">Receive<wbr>Packet<wbr>Header<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L31">rpc_server.ts:31</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L31">rpc_server.ts:31</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -140,7 +140,7 @@
 					<div class="tsd-signature tsd-kind-icon">Wait<wbr>For<wbr>Callback<span class="tsd-signature-symbol">:</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L30">rpc_server.ts:30</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L30">rpc_server.ts:30</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/api/typedoc/enums/sizeof.html b/docs/api/typedoc/enums/sizeof.html
index ae14170..7e338d2 100644
--- a/docs/api/typedoc/enums/sizeof.html
+++ b/docs/api/typedoc/enums/sizeof.html
@@ -100,7 +100,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLContext<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = I32 + I32</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L207">ctypes.ts:207</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L207">ctypes.ts:207</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -110,7 +110,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLData<wbr>Type<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = I32</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L206">ctypes.ts:206</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L206">ctypes.ts:206</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -120,7 +120,7 @@
 					<div class="tsd-signature tsd-kind-icon">F32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L203">ctypes.ts:203</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L203">ctypes.ts:203</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -130,7 +130,7 @@
 					<div class="tsd-signature tsd-kind-icon">F64<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L204">ctypes.ts:204</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L204">ctypes.ts:204</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -140,7 +140,7 @@
 					<div class="tsd-signature tsd-kind-icon">I32<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 4</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L201">ctypes.ts:201</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L201">ctypes.ts:201</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -150,7 +150,7 @@
 					<div class="tsd-signature tsd-kind-icon">I64<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L202">ctypes.ts:202</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L202">ctypes.ts:202</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -160,7 +160,7 @@
 					<div class="tsd-signature tsd-kind-icon">TVMValue<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 8</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L205">ctypes.ts:205</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L205">ctypes.ts:205</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -170,7 +170,7 @@
 					<div class="tsd-signature tsd-kind-icon">U16<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 2</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L200">ctypes.ts:200</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L200">ctypes.ts:200</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -180,7 +180,7 @@
 					<div class="tsd-signature tsd-kind-icon">U8<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol"> = 1</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L199">ctypes.ts:199</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L199">ctypes.ts:199</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/api/typedoc/index.html b/docs/api/typedoc/index.html
index 3ec2328..33a7936 100644
--- a/docs/api/typedoc/index.html
+++ b/docs/api/typedoc/index.html
@@ -174,7 +174,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Alloc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>shape<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, ndim<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, dtypeCode<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, dtypeBits<span class="tsd [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L112">ctypes.ts:112</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L112">ctypes.ts:112</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -238,7 +238,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>From<wbr>Bytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, data<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nbytes<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">num [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L128">ctypes.ts:128</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L128">ctypes.ts:128</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -282,7 +282,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>From<wbr>To<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>from<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, to<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, stream<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-sig [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L144">ctypes.ts:144</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L144">ctypes.ts:144</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -326,7 +326,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Copy<wbr>ToBytes<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, data<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nbytes<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</sp [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L136">ctypes.ts:136</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L136">ctypes.ts:136</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -370,7 +370,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMArray<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>handle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L121">ctypes.ts:121</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L121">ctypes.ts:121</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -406,7 +406,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMBackend<wbr>PackedCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>argValues<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, argCodes<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nargs<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number< [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L160">ctypes.ts:160</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L160">ctypes.ts:160</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -458,7 +458,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMCFunc<wbr>Set<wbr>Return<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>ret<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, value<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCode<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signa [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L77">ctypes.ts:77</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L77">ctypes.ts:77</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -506,7 +506,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMCb<wbr>Arg<wbr>ToReturn<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>value<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, code<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span c [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L83">ctypes.ts:83</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L83">ctypes.ts:83</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -545,7 +545,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Call<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>func<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, argValues<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCode<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-t [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L67">ctypes.ts:67</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L67">ctypes.ts:67</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -601,7 +601,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>func<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L57">ctypes.ts:57</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L57">ctypes.ts:57</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -637,7 +637,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Get<wbr>Global<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>name<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, out<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span cla [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L100">ctypes.ts:100</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L100">ctypes.ts:100</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -676,7 +676,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>List<wbr>Global<wbr>Names<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>outSize<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, outArray<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&g [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L88">ctypes.ts:88</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L88">ctypes.ts:88</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -715,7 +715,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMFunc<wbr>Register<wbr>Global<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>name<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, f<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, override<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</spa [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L94">ctypes.ts:94</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L94">ctypes.ts:94</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -758,7 +758,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMGet<wbr>Last<wbr>Error<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L34">ctypes.ts:34</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L34">ctypes.ts:34</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -788,7 +788,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Free<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L52">ctypes.ts:52</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L52">ctypes.ts:52</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -824,7 +824,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Get<wbr>Function<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, funcName<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, queryImports<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">numbe [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L42">ctypes.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L42">ctypes.ts:42</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -872,7 +872,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMMod<wbr>Import<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>mod<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, dep<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-si [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L48">ctypes.ts:48</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L48">ctypes.ts:48</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -912,7 +912,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMSynchronize<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>deviceType<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, deviceId<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, stream<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signatur [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L150">ctypes.ts:150</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L150">ctypes.ts:150</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -954,7 +954,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Alloc<wbr>Space<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>size<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L167">ctypes.ts:167</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L167">ctypes.ts:167</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -990,7 +990,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Free<wbr>Space<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>ptr<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L170">ctypes.ts:170</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L170">ctypes.ts:170</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1026,7 +1026,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>Func<wbr>Create<wbr>FromCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>resource<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, out<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&g [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L187">ctypes.ts:187</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L187">ctypes.ts:187</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1066,7 +1066,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>PackedCFunc<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>args<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, typeCodes<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a>, nargs<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">number</span>, [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L179">ctypes.ts:179</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L179">ctypes.ts:179</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1118,7 +1118,7 @@
 					<div class="tsd-signature tsd-kind-icon">FTVMWasm<wbr>PackedCFunc<wbr>Finalizer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>resourceHandle<span class="tsd-signature-symbol">: </span><a href="index.html#pointer" class="tsd-signature-type">Pointer</a><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L193">ctypes.ts:193</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L193">ctypes.ts:193</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1154,7 +1154,7 @@
 					<div class="tsd-signature tsd-kind-icon">GPUPointer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/webgpu.ts#L25">webgpu.ts:25</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/webgpu.ts#L25">webgpu.ts:25</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1169,7 +1169,7 @@
 					<div class="tsd-signature tsd-kind-icon">Packed<wbr>Func<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">...</span>args<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol"> &amp; </span><a href="interfaces/disp [...]
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L36">runtime.ts:36</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L36">runtime.ts:36</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1184,7 +1184,7 @@
 					<div class="tsd-signature tsd-kind-icon">Pointer<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L25">ctypes.ts:25</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L25">ctypes.ts:25</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1199,7 +1199,7 @@
 					<div class="tsd-signature tsd-kind-icon">Ptr<wbr>Offset<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/ctypes.ts#L28">ctypes.ts:28</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/ctypes.ts#L28">ctypes.ts:28</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1217,7 +1217,7 @@
 					<div class="tsd-signature tsd-kind-icon">RPC_<wbr>MAGIC<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">1045105</span><span class="tsd-signature-symbol"> = 1045105</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/rpc_server.ts#L36">rpc_server.ts:36</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/rpc_server.ts#L36">rpc_server.ts:36</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -1239,7 +1239,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/support.ts#L25">support.ts:25</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/support.ts#L25">support.ts:25</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1271,7 +1271,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/support.ts#L39">support.ts:39</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/support.ts#L39">support.ts:39</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1300,7 +1300,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/support.ts#L52">support.ts:52</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/support.ts#L52">support.ts:52</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1337,7 +1337,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/compact.ts#L38">compact.ts:38</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/compact.ts#L38">compact.ts:38</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1368,7 +1368,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/webgpu.ts#L30">webgpu.ts:30</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/webgpu.ts#L30">webgpu.ts:30</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1390,7 +1390,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/environment.ts#L32">environment.ts:32</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/environment.ts#L32">environment.ts:32</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1421,7 +1421,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/compact.ts#L24">compact.ts:24</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/compact.ts#L24">compact.ts:24</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1443,7 +1443,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L1360">runtime.ts:1360</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L1360">runtime.ts:1360</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1508,7 +1508,7 @@
 						<li class="tsd-description">
 							<aside class="tsd-sources">
 								<ul>
-									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/support.ts#L62">support.ts:62</a></li>
+									<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/support.ts#L62">support.ts:62</a></li>
 								</ul>
 							</aside>
 							<div class="tsd-comment tsd-typography">
@@ -1530,7 +1530,7 @@
 					<div class="tsd-signature tsd-kind-icon">DLData<wbr>Type<wbr>Code<wbr>ToStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L247">runtime.ts:247</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L247">runtime.ts:247</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1539,7 +1539,7 @@
 						<div class="tsd-signature tsd-kind-icon">0<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;int&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L248">runtime.ts:248</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L248">runtime.ts:248</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1549,7 +1549,7 @@
 						<div class="tsd-signature tsd-kind-icon">1<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;uint&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L249">runtime.ts:249</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L249">runtime.ts:249</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1559,7 +1559,7 @@
 						<div class="tsd-signature tsd-kind-icon">2<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;float&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L250">runtime.ts:250</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L250">runtime.ts:250</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1569,7 +1569,7 @@
 						<div class="tsd-signature tsd-kind-icon">3<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;handle&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L251">runtime.ts:251</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L251">runtime.ts:251</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1580,7 +1580,7 @@
 					<div class="tsd-signature tsd-kind-icon">Device<wbr>Enum<wbr>ToStr<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L175">runtime.ts:175</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L175">runtime.ts:175</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1589,7 +1589,7 @@
 						<div class="tsd-signature tsd-kind-icon">1<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;cpu&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L176">runtime.ts:176</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L176">runtime.ts:176</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1599,7 +1599,7 @@
 						<div class="tsd-signature tsd-kind-icon">15<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;webgpu&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L180">runtime.ts:180</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L180">runtime.ts:180</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1609,7 +1609,7 @@
 						<div class="tsd-signature tsd-kind-icon">2<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;gpu&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L177">runtime.ts:177</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L177">runtime.ts:177</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1619,7 +1619,7 @@
 						<div class="tsd-signature tsd-kind-icon">4<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;opencl&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L178">runtime.ts:178</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L178">runtime.ts:178</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1629,7 +1629,7 @@
 						<div class="tsd-signature tsd-kind-icon">8<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span><span class="tsd-signature-symbol"> = &quot;metal&quot;</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L179">runtime.ts:179</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L179">runtime.ts:179</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1640,7 +1640,7 @@
 					<div class="tsd-signature tsd-kind-icon">Device<wbr>Str<wbr>ToEnum<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">object</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L183">runtime.ts:183</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L183">runtime.ts:183</a></li>
 						</ul>
 					</aside>
 					<section class="tsd-panel tsd-member tsd-kind-variable tsd-parent-kind-object-literal">
@@ -1649,7 +1649,7 @@
 						<div class="tsd-signature tsd-kind-icon">cl<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 4</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L187">runtime.ts:187</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L187">runtime.ts:187</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1659,7 +1659,7 @@
 						<div class="tsd-signature tsd-kind-icon">cpu<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 1</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L184">runtime.ts:184</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L184">runtime.ts:184</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1669,7 +1669,7 @@
 						<div class="tsd-signature tsd-kind-icon">cuda<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 2</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L186">runtime.ts:186</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L186">runtime.ts:186</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1679,7 +1679,7 @@
 						<div class="tsd-signature tsd-kind-icon">gpu<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 2</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L185">runtime.ts:185</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L185">runtime.ts:185</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1689,7 +1689,7 @@
 						<div class="tsd-signature tsd-kind-icon">metal<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 8</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L190">runtime.ts:190</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L190">runtime.ts:190</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1699,7 +1699,7 @@
 						<div class="tsd-signature tsd-kind-icon">opencl<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 4</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L188">runtime.ts:188</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L188">runtime.ts:188</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1709,7 +1709,7 @@
 						<div class="tsd-signature tsd-kind-icon">vulkan<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 7</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L189">runtime.ts:189</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L189">runtime.ts:189</a></li>
 							</ul>
 						</aside>
 					</section>
@@ -1719,7 +1719,7 @@
 						<div class="tsd-signature tsd-kind-icon">webgpu<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">number</span><span class="tsd-signature-symbol"> = 15</span></div>
 						<aside class="tsd-sources">
 							<ul>
-								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/runtime.ts#L191">runtime.ts:191</a></li>
+								<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/runtime.ts#L191">runtime.ts:191</a></li>
 							</ul>
 						</aside>
 					</section>
diff --git a/docs/api/typedoc/interfaces/disposable.html b/docs/api/typedoc/interfaces/disposable.html
index 2302656..174a75a 100644
--- a/docs/api/typedoc/interfaces/disposable.html
+++ b/docs/api/typedoc/interfaces/disposable.html
@@ -113,7 +113,7 @@
 					<div class="tsd-signature tsd-kind-icon">dispose<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/types.ts#L52">types.ts:52</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/types.ts#L52">types.ts:52</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/api/typedoc/interfaces/functioninfo.html b/docs/api/typedoc/interfaces/functioninfo.html
index 5d23166..c671ecb 100644
--- a/docs/api/typedoc/interfaces/functioninfo.html
+++ b/docs/api/typedoc/interfaces/functioninfo.html
@@ -95,7 +95,7 @@
 					<div class="tsd-signature tsd-kind-icon">arg_<wbr>types<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/webgpu.ts#L41">webgpu.ts:41</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/webgpu.ts#L41">webgpu.ts:41</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -105,7 +105,7 @@
 					<div class="tsd-signature tsd-kind-icon">name<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">string</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/webgpu.ts#L40">webgpu.ts:40</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/webgpu.ts#L40">webgpu.ts:40</a></li>
 						</ul>
 					</aside>
 				</section>
@@ -115,7 +115,7 @@
 					<div class="tsd-signature tsd-kind-icon">thread_<wbr>axis_<wbr>tags<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Array</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/webgpu.ts#L42">webgpu.ts:42</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/webgpu.ts#L42">webgpu.ts:42</a></li>
 						</ul>
 					</aside>
 				</section>
diff --git a/docs/api/typedoc/interfaces/libraryprovider.html b/docs/api/typedoc/interfaces/libraryprovider.html
index d41c7a3..6bc464c 100644
--- a/docs/api/typedoc/interfaces/libraryprovider.html
+++ b/docs/api/typedoc/interfaces/libraryprovider.html
@@ -112,7 +112,7 @@
 					<div class="tsd-signature tsd-kind-icon">imports<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-type">Record</span><span class="tsd-signature-symbol">&lt;</span><span class="tsd-signature-type">string</span><span class="tsd-signature-symbol">, </span><span class="tsd-signature-type">any</span><span class="tsd-signature-symbol">&gt;</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/types.ts#L34">types.ts:34</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/types.ts#L34">types.ts:34</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
@@ -127,7 +127,7 @@
 					<div class="tsd-signature tsd-kind-icon">start<span class="tsd-signature-symbol">:</span> <span class="tsd-signature-symbol">(</span>inst<span class="tsd-signature-symbol">: </span><span class="tsd-signature-type">Instance</span><span class="tsd-signature-symbol">)</span><span class="tsd-signature-symbol"> =&gt; </span><span class="tsd-signature-type">void</span></div>
 					<aside class="tsd-sources">
 						<ul>
-							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/380239d/web/src/types.ts#L39">types.ts:39</a></li>
+							<li>Defined in <a href="https://github.com/apache/incubator-tvm/blob/4e6fe36/web/src/types.ts#L39">types.ts:39</a></li>
 						</ul>
 					</aside>
 					<div class="tsd-comment tsd-typography">
diff --git a/docs/genindex.html b/docs/genindex.html
index 93b3fb8..5e3c1a3 100644
--- a/docs/genindex.html
+++ b/docs/genindex.html
@@ -352,6 +352,8 @@
       </ul></li>
       <li><a href="api/python/autotvm.html#tvm.autotvm.apply_history_best">apply_history_best() (in module tvm.autotvm)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.ComputeDAG.apply_steps_from_state">apply_steps_from_state() (tvm.auto_scheduler.ComputeDAG method)</a>
+</li>
       <li><a href="api/python/autotvm.html#tvm.autotvm.task.dispatcher.ApplyConfig">ApplyConfig (class in tvm.autotvm.task.dispatcher)</a>
 </li>
       <li><a href="api/python/autotvm.html#tvm.autotvm.task.dispatcher.ApplyGraphBest">ApplyGraphBest (class in tvm.autotvm.task.dispatcher)</a>
@@ -482,7 +484,7 @@
 </li>
       <li><a href="api/python/tir.html#tvm.tir.AttrStmt">AttrStmt (class in tvm.tir)</a>
 </li>
-      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.auto_schedule.auto_schedule">auto_schedule() (in module tvm.auto_scheduler.auto_schedule)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.auto_schedule">auto_schedule() (in module tvm.auto_scheduler)</a>
 </li>
       <li><a href="api/python/relay/nn.html#tvm.relay.nn.avg_pool1d">avg_pool1d() (in module tvm.relay.nn)</a>
 </li>
@@ -820,6 +822,8 @@
 </li>
       <li><a href="api/python/runtime.html#tvm.runtime.TVMContext.compute_version">compute_version() (tvm.runtime.TVMContext property)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.ComputeDAG">ComputeDAG (class in tvm.auto_scheduler)</a>
+</li>
       <li><a href="api/python/te.html#tvm.te.ComputeOp">ComputeOp (class in tvm.te)</a>
 </li>
       <li><a href="api/python/relay/index.html#tvm.relay.concatenate">concatenate() (in module tvm.relay)</a>
@@ -872,10 +876,10 @@
 </li>
       <li><a href="api/python/relay/nn.html#tvm.relay.nn.contrib_conv2d_nchwc">contrib_conv2d_nchwc() (in module tvm.relay.nn)</a>
 </li>
-      <li><a href="api/python/relay/nn.html#tvm.relay.nn.contrib_conv2d_winograd_nnpack_weight_transform">contrib_conv2d_winograd_nnpack_weight_transform() (in module tvm.relay.nn)</a>
-</li>
   </ul></td>
   <td style="width: 33%; vertical-align: top;"><ul>
+      <li><a href="api/python/relay/nn.html#tvm.relay.nn.contrib_conv2d_winograd_nnpack_weight_transform">contrib_conv2d_winograd_nnpack_weight_transform() (in module tvm.relay.nn)</a>
+</li>
       <li><a href="api/python/relay/nn.html#tvm.relay.nn.contrib_conv2d_winograd_weight_transform">contrib_conv2d_winograd_weight_transform() (in module tvm.relay.nn)</a>
 </li>
       <li><a href="api/python/relay/nn.html#tvm.relay.nn.contrib_conv2d_winograd_without_weight_transform">contrib_conv2d_winograd_without_weight_transform() (in module tvm.relay.nn)</a>
@@ -1048,7 +1052,7 @@
         <li><a href="api/python/contrib.html#tvm.contrib.ndk.create_shared">(in module tvm.contrib.ndk)</a>
 </li>
       </ul></li>
-      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.auto_schedule.create_task">create_task() (in module tvm.auto_scheduler.auto_schedule)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.create_task">create_task() (in module tvm.auto_scheduler)</a>
 </li>
       <li><a href="api/python/contrib.html#tvm.contrib.emcc.create_tvmjs_wasm">create_tvmjs_wasm() (in module tvm.contrib.emcc)</a>
 </li>
@@ -1262,6 +1266,8 @@
 </li>
       <li><a href="api/python/ndarray.html#tvm.nd.empty">empty() (in module tvm.nd)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.EmptyPolicy">EmptyPolicy (class in tvm.auto_scheduler)</a>
+</li>
       <li><a href="api/python/runtime.html#tvm.runtime.enabled">enabled() (in module tvm.runtime)</a>
 </li>
       <li><a href="api/python/relay/testing.html#tvm.relay.testing.enabled_targets">enabled_targets() (in module tvm.relay.testing)</a>
@@ -1302,12 +1308,14 @@
 </li>
       <li><a href="api/python/relay/backend.html#tvm.relay.backend.interpreter.Executor.evaluate">evaluate() (tvm.relay.backend.interpreter.Executor method)</a>
 </li>
-      <li><a href="api/python/relay/backend.html#tvm.relay.backend.interpreter.Executor">Executor (class in tvm.relay.backend.interpreter)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.SketchPolicy.evolutionary_search">evolutionary_search() (tvm.auto_scheduler.SketchPolicy method)</a>
 </li>
-      <li><a href="api/python/runtime.html#tvm.runtime.TVMContext.exist">exist() (tvm.runtime.TVMContext property)</a>
+      <li><a href="api/python/relay/backend.html#tvm.relay.backend.interpreter.Executor">Executor (class in tvm.relay.backend.interpreter)</a>
 </li>
   </ul></td>
   <td style="width: 33%; vertical-align: top;"><ul>
+      <li><a href="api/python/runtime.html#tvm.runtime.TVMContext.exist">exist() (tvm.runtime.TVMContext property)</a>
+</li>
       <li><a href="api/python/relay/index.html#tvm.relay.exp">exp() (in module tvm.relay)</a>
 
       <ul>
@@ -1576,6 +1584,8 @@
 </li>
       <li><a href="api/python/tir.html#tvm.tir.GE">GE (class in tvm.tir)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.SketchPolicy.generate_sketches">generate_sketches() (tvm.auto_scheduler.SketchPolicy method)</a>
+</li>
       <li><a href="api/python/target.html#tvm.target.generic_func">generic_func() (in module tvm.target)</a>
 </li>
       <li><a href="api/python/target.html#tvm.target.GenericFunc">GenericFunc (class in tvm.target)</a>
@@ -1648,6 +1658,8 @@
 </li>
       <li><a href="api/python/ir.html#tvm.ir.IRModule.get_global_vars">get_global_vars() (tvm.ir.IRModule method)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.ComputeDAG.get_init_state">get_init_state() (tvm.auto_scheduler.ComputeDAG method)</a>
+</li>
       <li><a href="api/python/graph_runtime.html#tvm.contrib.graph_runtime.GraphModule.get_input">get_input() (tvm.contrib.graph_runtime.GraphModule method)</a>
 </li>
       <li><a href="api/python/ir.html#tvm.ir.Attrs.get_int">get_int() (tvm.ir.Attrs method)</a>
@@ -1850,6 +1862,8 @@
 <h2 id="H">H</h2>
 <table style="width: 100%" class="indextable genindextable"><tr>
   <td style="width: 33%; vertical-align: top;"><ul>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.HardwareParams">HardwareParams (class in tvm.auto_scheduler)</a>
+</li>
       <li><a href="api/python/relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.has_attr">has_attr() (in module tvm.relay.dataflow_pattern)</a>
 
       <ul>
@@ -1972,6 +1986,8 @@
         <li><a href="api/python/tir.html#tvm.tir.indexmod">(in module tvm.tir)</a>
 </li>
       </ul></li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.ComputeDAG.infer_bound_from_state">infer_bound_from_state() (tvm.auto_scheduler.ComputeDAG method)</a>
+</li>
       <li><a href="api/python/tir.html#tvm.tir.transform.InferFragment">InferFragment() (in module tvm.tir.transform)</a>
 </li>
       <li><a href="api/python/relay/transform.html#tvm.relay.transform.InferType">InferType() (in module tvm.relay.transform)</a>
@@ -2178,12 +2194,16 @@
 </li>
       <li><a href="api/python/tir.html#tvm.tir.Load">Load (class in tvm.tir)</a>
 </li>
-      <li><a href="api/python/autotvm.html#tvm.autotvm.task.dispatcher.ApplyHistoryBest.load">load() (tvm.autotvm.task.dispatcher.ApplyHistoryBest method)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.XGBModel.load">load() (tvm.auto_scheduler.XGBModel method)</a>
 
       <ul>
+        <li><a href="api/python/autotvm.html#tvm.autotvm.task.dispatcher.ApplyHistoryBest.load">(tvm.autotvm.task.dispatcher.ApplyHistoryBest method)</a>
+</li>
         <li><a href="api/python/te.html#tvm.te.hybrid.HybridModule.load">(tvm.te.hybrid.HybridModule method)</a>
 </li>
       </ul></li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.load_best">load_best() (in module tvm.auto_scheduler)</a>
+</li>
       <li><a href="api/python/autotvm.html#tvm.autotvm.record.load_from_file">load_from_file() (in module tvm.autotvm.record)</a>
 </li>
       <li><a href="api/python/autotvm.html#tvm.autotvm.tuner.GATuner.load_history">load_history() (tvm.autotvm.tuner.GATuner method)</a>
@@ -2218,15 +2238,17 @@
       </ul></li>
   </ul></td>
   <td style="width: 33%; vertical-align: top;"><ul>
-      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.measure.LocalBuilder">LocalBuilder (class in tvm.auto_scheduler.measure)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.load_records">load_records() (in module tvm.auto_scheduler)</a>
+</li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.LocalBuilder">LocalBuilder (class in tvm.auto_scheduler)</a>
 
       <ul>
         <li><a href="api/python/autotvm.html#tvm.autotvm.measure.measure_methods.LocalBuilder">(class in tvm.autotvm.measure.measure_methods)</a>
 </li>
       </ul></li>
-      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.measure.LocalRPCMeasureContext">LocalRPCMeasureContext (class in tvm.auto_scheduler.measure)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.LocalRPCMeasureContext">LocalRPCMeasureContext (class in tvm.auto_scheduler)</a>
 </li>
-      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.measure.LocalRunner">LocalRunner (class in tvm.auto_scheduler.measure)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.LocalRunner">LocalRunner (class in tvm.auto_scheduler)</a>
 
       <ul>
         <li><a href="api/python/autotvm.html#tvm.autotvm.measure.measure_methods.LocalRunner">(class in tvm.autotvm.measure.measure_methods)</a>
@@ -2356,6 +2378,8 @@
         <li><a href="api/python/relay/dataflow_pattern.html#tvm.relay.dataflow_pattern.make_node">(in module tvm.relay.dataflow_pattern)</a>
 </li>
       </ul></li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.make_workload_key">make_workload_key() (in module tvm.auto_scheduler)</a>
+</li>
       <li><a href="api/python/tir.html#tvm.tir.transform.MakePackedAPI">MakePackedAPI() (in module tvm.tir.transform)</a>
 </li>
       <li><a href="api/python/target.html#tvm.target.mali">mali() (in module tvm.target)</a>
@@ -2440,10 +2464,18 @@
 </li>
       <li><a href="api/python/autotvm.html#tvm.autotvm.record.measure_str_key">measure_str_key() (in module tvm.autotvm.record)</a>
 </li>
-      <li><a href="api/python/autotvm.html#tvm.autotvm.measure.MeasureInput">MeasureInput (class in tvm.autotvm.measure)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.MeasureInput">MeasureInput (class in tvm.auto_scheduler)</a>
+
+      <ul>
+        <li><a href="api/python/autotvm.html#tvm.autotvm.measure.MeasureInput">(class in tvm.autotvm.measure)</a>
 </li>
-      <li><a href="api/python/autotvm.html#tvm.autotvm.measure.MeasureResult">MeasureResult (class in tvm.autotvm.measure)</a>
+      </ul></li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.MeasureResult">MeasureResult (class in tvm.auto_scheduler)</a>
+
+      <ul>
+        <li><a href="api/python/autotvm.html#tvm.autotvm.measure.MeasureResult">(class in tvm.autotvm.measure)</a>
 </li>
+      </ul></li>
       <li><a href="api/python/contrib.html#tvm.contrib.pickle_memoize.memoize">memoize() (in module tvm.contrib.pickle_memoize)</a>
 </li>
       <li><a href="api/python/relay/transform.html#tvm.relay.transform.MergeCompilerRegions">MergeCompilerRegions() (in module tvm.relay.transform)</a>
@@ -2516,10 +2548,6 @@
       <ul>
         <li><a href="api/python/auto_scheduler.html#module-tvm.auto_scheduler">tvm.auto_scheduler</a>
 </li>
-        <li><a href="api/python/auto_scheduler.html#module-tvm.auto_scheduler.auto_schedule">tvm.auto_scheduler.auto_schedule</a>
-</li>
-        <li><a href="api/python/auto_scheduler.html#module-tvm.auto_scheduler.measure">tvm.auto_scheduler.measure</a>
-</li>
         <li><a href="api/python/autotvm.html#module-tvm.autotvm">tvm.autotvm</a>
 </li>
         <li><a href="api/python/autotvm.html#module-tvm.autotvm.measure.measure">tvm.autotvm.measure.measure</a>
@@ -2913,8 +2941,6 @@
         <li><a href="api/python/tir.html#tvm.tir.popcount">(in module tvm.tir)</a>
 </li>
       </ul></li>
-  </ul></td>
-  <td style="width: 33%; vertical-align: top;"><ul>
       <li><a href="api/python/contrib.html#tvm.contrib.xcode.popen_test_rpc">popen_test_rpc() (in module tvm.contrib.xcode)</a>
 </li>
       <li><a href="api/python/rpc.html#tvm.rpc.PopenSession">PopenSession (class in tvm.rpc)</a>
@@ -2925,6 +2951,8 @@
         <li><a href="api/python/tir.html#tvm.tir.stmt_functor.post_order_visit">(in module tvm.tir.stmt_functor)</a>
 </li>
       </ul></li>
+  </ul></td>
+  <td style="width: 33%; vertical-align: top;"><ul>
       <li><a href="api/python/relay/index.html#tvm.relay.power">power() (in module tvm.relay)</a>
 
       <ul>
@@ -2937,10 +2965,20 @@
       </ul></li>
       <li><a href="api/python/te.html#tvm.te.Stage.pragma">pragma() (tvm.te.Stage method)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.RandomModel.predict">predict() (tvm.auto_scheduler.RandomModel method)</a>
+
+      <ul>
+        <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.XGBModel.predict">(tvm.auto_scheduler.XGBModel method)</a>
+</li>
+      </ul></li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.XGBModel.predict_stages">predict_stages() (tvm.auto_scheduler.XGBModel method)</a>
+</li>
       <li><a href="api/python/tir.html#tvm.tir.Prefetch">Prefetch (class in tvm.tir)</a>
 </li>
       <li><a href="api/python/te.html#tvm.te.Stage.prefetch">prefetch() (tvm.te.Stage method)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.PreloadMeasuredStates">PreloadMeasuredStates (class in tvm.auto_scheduler)</a>
+</li>
       <li><a href="api/python/relay/nn.html#tvm.relay.nn.prelu">prelu() (in module tvm.relay.nn)</a>
 
       <ul>
@@ -2963,6 +3001,8 @@
 </li>
       <li><a href="api/python/ir.html#tvm.ir.PrimType">PrimType (class in tvm.ir)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.ComputeDAG.print_python_code_from_state">print_python_code_from_state() (tvm.auto_scheduler.ComputeDAG method)</a>
+</li>
       <li><a href="api/python/relay/analysis.html#tvm.relay.analysis.CallGraph.print_var">print_var() (tvm.relay.analysis.CallGraph method)</a>
 </li>
       <li><a href="api/python/ir.html#tvm.transform.PrintIR">PrintIR() (in module tvm.transform)</a>
@@ -3011,6 +3051,8 @@
 </li>
       <li><a href="api/python/contrib.html#tvm.contrib.random.randint">randint() (in module tvm.contrib.random)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.RandomModel">RandomModel (class in tvm.auto_scheduler)</a>
+</li>
       <li><a href="api/python/autotvm.html#tvm.autotvm.tuner.RandomTuner">RandomTuner (class in tvm.autotvm.tuner)</a>
 </li>
       <li><a href="api/python/ir.html#tvm.ir.Range">Range (class in tvm.ir)</a>
@@ -3025,8 +3067,14 @@
         <li><a href="api/python/micro.html#tvm.micro.TransportLogger.read">(tvm.micro.TransportLogger method)</a>
 </li>
       </ul></li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.RecordReader.read_lines">read_lines() (tvm.auto_scheduler.RecordReader method)</a>
+</li>
       <li><a href="api/python/vta/index.html#vta.reconfig_runtime">reconfig_runtime() (in module vta)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.RecordReader">RecordReader (class in tvm.auto_scheduler)</a>
+</li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.RecordToFile">RecordToFile (class in tvm.auto_scheduler)</a>
+</li>
       <li><a href="api/python/tir.html#tvm.tir.Reduce">Reduce (class in tvm.tir)</a>
 </li>
       <li><a href="api/python/te.html#tvm.te.reduce_axis">reduce_axis() (in module tvm.te)</a>
@@ -3061,7 +3109,7 @@
 </li>
       <li><a href="api/python/autotvm.html#tvm.autotvm.task.topi_integration.register_topi_schedule">register_topi_schedule() (in module tvm.autotvm.task.topi_integration)</a>
 </li>
-      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.workload_registry.register_workload">register_workload() (in module tvm.auto_scheduler.workload_registry)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.register_workload">register_workload() (in module tvm.auto_scheduler)</a>
 </li>
       <li><a href="api/python/relay/index.html#tvm.relay.reinterpret">reinterpret() (in module tvm.relay)</a>
 
@@ -3109,6 +3157,8 @@
 </li>
       <li><a href="api/python/rpc.html#tvm.rpc.TrackerSession.request_and_run">request_and_run() (tvm.rpc.TrackerSession method)</a>
 </li>
+  </ul></td>
+  <td style="width: 33%; vertical-align: top;"><ul>
       <li><a href="api/python/autotvm.html#tvm.autotvm.task.topi_integration.TaskExtractEnv.reset">reset() (tvm.autotvm.task.topi_integration.TaskExtractEnv method)</a>
 
       <ul>
@@ -3123,8 +3173,6 @@
         <li><a href="api/python/autotvm.html#tvm.autotvm.tuner.XGBTuner.reset">(tvm.autotvm.tuner.XGBTuner method)</a>
 </li>
       </ul></li>
-  </ul></td>
-  <td style="width: 33%; vertical-align: top;"><ul>
       <li><a href="api/python/ir.html#tvm.ir.Op.reset_attr">reset_attr() (tvm.ir.Op method)</a>
 </li>
       <li><a href="api/python/relay/index.html#tvm.relay.reshape">reshape() (in module tvm.relay)</a>
@@ -3211,7 +3259,7 @@
       </ul></li>
       <li><a href="api/python/error.html#tvm.error.RPCError">RPCError</a>
 </li>
-      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.measure.RPCRunner">RPCRunner (class in tvm.auto_scheduler.measure)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.RPCRunner">RPCRunner (class in tvm.auto_scheduler)</a>
 
       <ul>
         <li><a href="api/python/autotvm.html#tvm.autotvm.measure.measure_methods.RPCRunner">(class in tvm.autotvm.measure.measure_methods)</a>
@@ -3245,12 +3293,20 @@
         <li><a href="api/python/ndarray.html#tvm.nd.NDArray.same_as">(tvm.nd.NDArray method)</a>
 </li>
       </ul></li>
-      <li><a href="api/python/runtime.html#tvm.runtime.Module.save">save() (tvm.runtime.Module method)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.SketchPolicy.sample_initial_population">sample_initial_population() (tvm.auto_scheduler.SketchPolicy method)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.XGBModel.save">save() (tvm.auto_scheduler.XGBModel method)</a>
+
+      <ul>
+        <li><a href="api/python/runtime.html#tvm.runtime.Module.save">(tvm.runtime.Module method)</a>
+</li>
+      </ul></li>
       <li><a href="api/python/ir.html#tvm.ir.save_json">save_json() (in module tvm.ir)</a>
 </li>
       <li><a href="api/python/relay/index.html#tvm.relay.save_param_dict">save_param_dict() (in module tvm.relay)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.save_records">save_records() (in module tvm.auto_scheduler)</a>
+</li>
       <li><a href="api/python/relay/index.html#tvm.relay.scalar_type">scalar_type() (in module tvm.relay)</a>
 </li>
       <li><a href="api/python/topi.html#tvm.topi.nn.scale_shift_nchw">scale_shift_nchw() (in module tvm.topi.nn)</a>
@@ -3283,7 +3339,7 @@
 </li>
       <li><a href="api/python/relay/analysis.html#tvm.relay.analysis.search_fc_transpose">search_fc_transpose() (in module tvm.relay.analysis)</a>
 </li>
-      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.auto_schedule.SearchTask">SearchTask (class in tvm.auto_scheduler.auto_schedule)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.SearchTask">SearchTask (class in tvm.auto_scheduler)</a>
 </li>
       <li><a href="api/python/tir.html#tvm.tir.Select">Select (class in tvm.tir)</a>
 </li>
@@ -3391,12 +3447,14 @@
         <li><a href="api/python/topi.html#tvm.topi.sinh">(in module tvm.topi)</a>
 </li>
       </ul></li>
-      <li><a href="api/python/te.html#tvm.te.size_var">size_var() (in module tvm.te)</a>
-</li>
   </ul></td>
   <td style="width: 33%; vertical-align: top;"><ul>
+      <li><a href="api/python/te.html#tvm.te.size_var">size_var() (in module tvm.te)</a>
+</li>
       <li><a href="api/python/tir.html#tvm.tir.SizeVar">SizeVar (class in tvm.tir)</a>
 </li>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.SketchPolicy">SketchPolicy (class in tvm.auto_scheduler)</a>
+</li>
       <li><a href="api/python/tir.html#tvm.tir.transform.SkipAssert">SkipAssert() (in module tvm.tir.transform)</a>
 </li>
       <li><a href="api/python/relay/index.html#tvm.relay.slice_like">slice_like() (in module tvm.relay)</a>
@@ -3723,7 +3781,7 @@
       </ul></li>
       <li><a href="api/python/autotvm.html#tvm.autotvm.tuner.Tuner">Tuner (class in tvm.autotvm.tuner)</a>
 </li>
-      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.auto_schedule.TuningOptions">TuningOptions (class in tvm.auto_scheduler.auto_schedule)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.TuningOptions">TuningOptions (class in tvm.auto_scheduler)</a>
 </li>
       <li><a href="api/python/relay/index.html#tvm.relay.Tuple">Tuple (class in tvm.relay)</a>
 </li>
@@ -3745,20 +3803,6 @@
 </li>
       </ul></li>
       <li>
-    tvm.auto_scheduler.auto_schedule
-
-      <ul>
-        <li><a href="api/python/auto_scheduler.html#module-tvm.auto_scheduler.auto_schedule">module</a>
-</li>
-      </ul></li>
-      <li>
-    tvm.auto_scheduler.measure
-
-      <ul>
-        <li><a href="api/python/auto_scheduler.html#module-tvm.auto_scheduler.measure">module</a>
-</li>
-      </ul></li>
-      <li>
     tvm.autotvm
 
       <ul>
@@ -3905,8 +3949,6 @@
         <li><a href="api/python/contrib.html#module-tvm.contrib.ndk">module</a>
 </li>
       </ul></li>
-  </ul></td>
-  <td style="width: 33%; vertical-align: top;"><ul>
       <li>
     tvm.contrib.nnpack
 
@@ -3914,6 +3956,8 @@
         <li><a href="api/python/contrib.html#module-tvm.contrib.nnpack">module</a>
 </li>
       </ul></li>
+  </ul></td>
+  <td style="width: 33%; vertical-align: top;"><ul>
       <li>
     tvm.contrib.nvcc
 
@@ -4340,9 +4384,13 @@
 </li>
       <li><a href="api/python/contrib.html#tvm.contrib.tar.untar">untar() (in module tvm.contrib.tar)</a>
 </li>
-      <li><a href="api/python/autotvm.html#tvm.autotvm.task.dispatcher.ApplyConfig.update">update() (tvm.autotvm.task.dispatcher.ApplyConfig method)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.RandomModel.update">update() (tvm.auto_scheduler.RandomModel method)</a>
 
       <ul>
+        <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.XGBModel.update">(tvm.auto_scheduler.XGBModel method)</a>
+</li>
+        <li><a href="api/python/autotvm.html#tvm.autotvm.task.dispatcher.ApplyConfig.update">(tvm.autotvm.task.dispatcher.ApplyConfig method)</a>
+</li>
         <li><a href="api/python/autotvm.html#tvm.autotvm.task.dispatcher.ApplyGraphBest.update">(tvm.autotvm.task.dispatcher.ApplyGraphBest method)</a>
 </li>
         <li><a href="api/python/autotvm.html#tvm.autotvm.task.dispatcher.ApplyHistoryBest.update">(tvm.autotvm.task.dispatcher.ApplyHistoryBest method)</a>
@@ -4366,6 +4414,8 @@
       </ul></li>
   </ul></td>
   <td style="width: 33%; vertical-align: top;"><ul>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.XGBModel.update_from_file">update_from_file() (tvm.auto_scheduler.XGBModel method)</a>
+</li>
       <li><a href="api/python/ir.html#tvm.ir.IRModule.update_func">update_func() (tvm.ir.IRModule method)</a>
 </li>
       <li><a href="api/python/rpc.html#tvm.rpc.RPCSession.upload">upload() (tvm.rpc.RPCSession method)</a>
@@ -4513,9 +4563,11 @@
   <td style="width: 33%; vertical-align: top;"><ul>
       <li><a href="api/python/contrib.html#tvm.contrib.xcode.XCodeRPCServer">XCodeRPCServer (class in tvm.contrib.xcode)</a>
 </li>
+      <li><a href="api/python/contrib.html#tvm.contrib.xcode.xcrun">xcrun() (in module tvm.contrib.xcode)</a>
+</li>
   </ul></td>
   <td style="width: 33%; vertical-align: top;"><ul>
-      <li><a href="api/python/contrib.html#tvm.contrib.xcode.xcrun">xcrun() (in module tvm.contrib.xcode)</a>
+      <li><a href="api/python/auto_scheduler.html#tvm.auto_scheduler.XGBModel">XGBModel (class in tvm.auto_scheduler)</a>
 </li>
       <li><a href="api/python/autotvm.html#tvm.autotvm.tuner.XGBTuner">XGBTuner (class in tvm.autotvm.tuner)</a>
 </li>
diff --git a/docs/objects.inv b/docs/objects.inv
index dbf0ac4..d52b148 100644
Binary files a/docs/objects.inv and b/docs/objects.inv differ
diff --git a/docs/py-modindex.html b/docs/py-modindex.html
index e840082..d1a70eb 100644
--- a/docs/py-modindex.html
+++ b/docs/py-modindex.html
@@ -215,16 +215,6 @@
      <tr class="cg-1">
        <td></td>
        <td>&#160;&#160;&#160;
-       <a href="api/python/auto_scheduler.html#module-tvm.auto_scheduler.auto_schedule"><code class="xref">tvm.auto_scheduler.auto_schedule</code></a></td><td>
-       <em></em></td></tr>
-     <tr class="cg-1">
-       <td></td>
-       <td>&#160;&#160;&#160;
-       <a href="api/python/auto_scheduler.html#module-tvm.auto_scheduler.measure"><code class="xref">tvm.auto_scheduler.measure</code></a></td><td>
-       <em></em></td></tr>
-     <tr class="cg-1">
-       <td></td>
-       <td>&#160;&#160;&#160;
        <a href="api/python/autotvm.html#module-tvm.autotvm"><code class="xref">tvm.autotvm</code></a></td><td>
        <em></em></td></tr>
      <tr class="cg-1">
diff --git a/docs/searchindex.js b/docs/searchindex.js
index b5e2af3..1577389 100644
--- a/docs/searchindex.js
+++ b/docs/searchindex.js
@@ -1 +1 @@
-Search.setIndex({docnames:["api/links","api/python/auto_scheduler","api/python/autotvm","api/python/contrib","api/python/driver","api/python/error","api/python/graph_runtime","api/python/index","api/python/ir","api/python/micro","api/python/ndarray","api/python/relay/analysis","api/python/relay/backend","api/python/relay/dataflow_pattern","api/python/relay/frontend","api/python/relay/image","api/python/relay/index","api/python/relay/nn","api/python/relay/testing","api/python/relay/transf [...]
\ No newline at end of file
+Search.setIndex({docnames:["api/links","api/python/auto_scheduler","api/python/autotvm","api/python/contrib","api/python/driver","api/python/error","api/python/graph_runtime","api/python/index","api/python/ir","api/python/micro","api/python/ndarray","api/python/relay/analysis","api/python/relay/backend","api/python/relay/dataflow_pattern","api/python/relay/frontend","api/python/relay/image","api/python/relay/index","api/python/relay/nn","api/python/relay/testing","api/python/relay/transf [...]
\ No newline at end of file
diff --git a/docs/tutorials/auto_scheduler/sg_execution_times.html b/docs/tutorials/auto_scheduler/sg_execution_times.html
index 89af54b..7d7d125 100644
--- a/docs/tutorials/auto_scheduler/sg_execution_times.html
+++ b/docs/tutorials/auto_scheduler/sg_execution_times.html
@@ -192,10 +192,10 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-tutorials-auto-scheduler-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>04:47.459</strong> total execution time for <strong>tutorials_auto_scheduler</strong> files:</p>
+<p><strong>04:38.356</strong> total execution time for <strong>tutorials_auto_scheduler</strong> files:</p>
 <ul class="simple">
-<li><p><strong>02:51.342</strong>: <a class="reference internal" href="tune_conv2d_layer_cuda.html#sphx-glr-tutorials-auto-scheduler-tune-conv2d-layer-cuda-py"><span class="std std-ref">Auto-scheduling a convolution layer for GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_layer_cuda.py</span></code>)</p></li>
-<li><p><strong>01:56.117</strong>: <a class="reference internal" href="tune_matmul_x86.html#sphx-glr-tutorials-auto-scheduler-tune-matmul-x86-py"><span class="std std-ref">Auto-scheduling matrix multiplication for CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_matmul_x86.py</span></code>)</p></li>
+<li><p><strong>02:51.937</strong>: <a class="reference internal" href="tune_conv2d_layer_cuda.html#sphx-glr-tutorials-auto-scheduler-tune-conv2d-layer-cuda-py"><span class="std std-ref">Auto-scheduling a convolution layer for GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_layer_cuda.py</span></code>)</p></li>
+<li><p><strong>01:46.419</strong>: <a class="reference internal" href="tune_matmul_x86.html#sphx-glr-tutorials-auto-scheduler-tune-matmul-x86-py"><span class="std std-ref">Auto-scheduling matrix multiplication for CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_matmul_x86.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html b/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html
index 23f7b14..3674730 100644
--- a/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html
+++ b/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html
@@ -248,7 +248,7 @@ From these tensors, the auto-scheduler can get the whole computational graph.</p
 
 <span class="c1"># the last layer in resnet</span>
 <span class="n">N</span><span class="p">,</span> <span class="n">H</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">CO</span><span class="p">,</span> <span class="n">CI</span><span class="p">,</span> <span class="n">KH</span><span class="p">,</span> <span class="n">KW</span><span class="p">,</span> <span class="n">strides</span><span class="p">,</span> <span class="n">padding</span> <span class="o">=</span> <span class="mi">1</span><span cla [...]
-<span class="n">task</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">create_task</span><span class="p">(</span><span class="n">conv2d_layer</span><span class="p">,</span> <span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">H</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">CO</span><span class="p">,</span> <span class="n">CI</span><span class="p [...]
+<span class="n">task</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.create_task" title="View documentation for tvm.auto_scheduler.create_task"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">create_task</span></a><span class="p">(</span><span class="n">conv2d_layer</span><span class="p">,</span> <span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">H</span><span class="p">,</ [...]
 
 <span class="c1"># Inspect the computational graph</span>
 <span class="nb">print</span><span class="p">(</span><span class="n">task</span><span class="o">.</span><span class="n">compute_dag</span><span class="p">)</span>
@@ -267,26 +267,26 @@ compute(i0, i1, i2, i3) = max(T_add[i0, i1, i2, i3], 0f)
 <p>Next, we set parameters for the auto-scheduler. These parameters
 mainly specify how we do the measurement during the search and auto-tuning.</p>
 <ul class="simple">
-<li><p><cite>measure_ctx</cite> launches a different process for measurement. This
+<li><p><code class="code docutils literal notranslate"><span class="pre">measure_ctx</span></code> launches a different process for measurement. This
 provides an isolation. It can protect the master process from GPU crashes
 happended during measurement and avoid other runtime conflicts.</p></li>
-<li><p><cite>min_repeat_ms</cite> defines the minimum duration of one “repeat” in every measurement.
+<li><p><code class="code docutils literal notranslate"><span class="pre">min_repeat_ms</span></code> defines the minimum duration of one “repeat” in every measurement.
 This can warmup the GPU, which is necessary to get accurate measurement results.
 Typically, we recommend a value &gt; 300 ms.</p></li>
-<li><p><cite>num_measure_trials</cite> is the number of measurement trials we can use during the search.
+<li><p><code class="code docutils literal notranslate"><span class="pre">num_measure_trials</span></code> is the number of measurement trials we can use during the search.
 We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a
 good value for the search to converge. You can do more trials according to your time budget.</p></li>
-<li><p>In addition, we use <cite>RecordToFile</cite> to dump measurement records into a file <cite>conv2d.json</cite>.
+<li><p>In addition, we use <code class="code docutils literal notranslate"><span class="pre">RecordToFile</span></code> to dump measurement records into a file <cite>conv2d.json</cite>.
 The measurement records can be used to query the history best, resume the search,
 and do more analyses later.</p></li>
-<li><p>see <a class="reference internal" href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.auto_schedule.TuningOptions" title="tvm.auto_scheduler.auto_schedule.TuningOptions"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.auto_schedule.TuningOptions</span></code></a>:,
-<a class="reference internal" href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.measure.LocalRPCMeasureContext" title="tvm.auto_scheduler.measure.LocalRPCMeasureContext"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.measure.LocalRPCMeasureContext</span></code></a> for more parameters.</p></li>
+<li><p>see <a class="reference internal" href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.TuningOptions" title="tvm.auto_scheduler.TuningOptions"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.TuningOptions</span></code></a>,
+<a class="reference internal" href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.LocalRPCMeasureContext" title="tvm.auto_scheduler.LocalRPCMeasureContext"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.LocalRPCMeasureContext</span></code></a> for more parameters.</p></li>
 </ul>
-<div class="highlight-default notranslate"><div class="highlight"><pre><span class="n">measure_ctx</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">LocalRPCMeasureContext</span><span class="p">(</span><span class="n">min_repeat_ms</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
-<span class="n">tune_option</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">TuningOptions</span><span class="p">(</span>
+<div class="highlight-default notranslate"><div class="highlight"><pre><span class="n">measure_ctx</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.LocalRPCMeasureContext" title="View documentation for tvm.auto_scheduler.LocalRPCMeasureContext"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">LocalRPCMeasureContext</span></a><span class="p">(</span><span class="n">min_repeat_ms</span><span class="o">=</span><span [...]
+<span class="n">tune_option</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.TuningOptions" title="View documentation for tvm.auto_scheduler.TuningOptions"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">TuningOptions</span></a><span class="p">(</span>
     <span class="n">num_measure_trials</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
     <span class="n">runner</span><span class="o">=</span><span class="n">measure_ctx</span><span class="o">.</span><span class="n">runner</span><span class="p">,</span>
-    <span class="n">measure_callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">RecordToFile</span><span class="p">(</span><span class="s2">&quot;conv2d.json&quot;</span><span class="p">)],</span>
+    <span class="n">measure_callbacks</span><span class="o">=</span><span class="p">[</span><a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.RecordToFile" title="View documentation for tvm.auto_scheduler.RecordToFile"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">RecordToFile</span></a><span class="p">(</span><span class="s2">&quot;conv2d.json&quot;</span><span class="p">)],</span>
 <span class="p">)</span>
 </pre></div>
 </div>
@@ -300,7 +300,7 @@ and do more analyses later.</p></li>
 <p>Now we get all inputs ready. Pretty simple, isn’t it?
 We can kick off the search and let the auto-scheduler do its magic.
 After some measurement trials, it will return the best schedule it found.</p>
-<div class="highlight-default notranslate"><div class="highlight"><pre><span class="n">sch</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#module-tvm.auto_scheduler.auto_schedule" title="View documentation for tvm.auto_scheduler.auto_schedule"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">auto_schedule</span></a><span class="p">(</span><span class="n">task</span><span class="p [...]
+<div class="highlight-default notranslate"><div class="highlight"><pre><span class="n">sch</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.auto_schedule" title="View documentation for tvm.auto_scheduler.auto_schedule"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">auto_schedule</span></a><span class="p">(</span><span class="n">task</span><span class="p">,</sp [...]
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
@@ -316,1110 +316,319 @@ cooperative fetching, unrolling and operator fusion.</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(data_1: handle, kernel_1: handle, bias_1: handle, compute_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
-  buffers = {bias: Buffer(bias_2: Pointer(float32), float32, [1, 512, 1, 1], []),
-             compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], []),
+  buffers = {compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], []),
+             bias: Buffer(bias_2: Pointer(float32), float32, [1, 512, 1, 1], []),
              kernel: Buffer(kernel_2: Pointer(float32), float32, [512, 512, 3, 3], []),
              data: Buffer(data_2: Pointer(float32), float32, [1, 512, 7, 7], [])}
   buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} {
-  attr [IterVar(blockIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;blockIdx.x&quot;)] &quot;thread_extent&quot; = 64;
+  attr [IterVar(blockIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;blockIdx.x&quot;)] &quot;thread_extent&quot; = 16;
   attr [compute_3: Pointer(float32)] &quot;storage_scope&quot; = &quot;local&quot;;
-  allocate(compute_3, float32, [8]);
+  allocate(compute_3, float32, [14]);
   attr [pad_temp.shared: Pointer(float32)] &quot;storage_scope&quot; = &quot;shared&quot;;
-  allocate(pad_temp.shared, float32, [1568]);
+  allocate(pad_temp.shared, float32, [162]);
   attr [kernel.shared: Pointer(float32)] &quot;storage_scope&quot; = &quot;shared&quot;;
-  allocate(kernel.shared, float32, [256]);
-  attr [IterVar(threadIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
+  allocate(kernel.shared, float32, [576]);
+  attr [IterVar(threadIdx.x: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112 {
     compute_3[0] = 0f32
-    compute_3[2] = 0f32
-    compute_3[4] = 0f32
-    compute_3[6] = 0f32
     compute_3[1] = 0f32
+    compute_3[2] = 0f32
     compute_3[3] = 0f32
+    compute_3[4] = 0f32
     compute_3[5] = 0f32
+    compute_3[6] = 0f32
     compute_3[7] = 0f32
-    for (rc.outer.outer: int32, 0, 16) {
-      for (ry.outer.outer: int32, 0, 3) {
-        attr [IterVar(threadIdx.x_1: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[(threadIdx.x_1*2)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) - 8)], 0f32, dtype=float32)
-          pad_temp.shared[((threadIdx.x_1*2) + 1)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) - 8)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 98)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 90)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 98)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 90)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 196)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 188)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 196)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 188)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 294)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 286)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 294)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 286)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 392)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 384)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 392)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 384)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 490)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 482)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 490)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 482)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 588)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 580)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 588)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 580)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 686)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 678)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 686)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 678)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 784)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 776)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 784)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 776)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 882)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 874)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 882)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 874)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 980)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 972)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 980)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 972)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1078)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1070)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1078)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1070)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1176)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1168)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1176)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1168)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1274)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1266)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1274)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1266)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1372)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1364)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1372)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1364)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1470)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*2), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1462)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1470)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*2) + 1), 7))), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1462)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_2: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[(threadIdx.x_2*12)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2*12), 32)*4608)) + (rc.outer.outer*288)) + (floormod((threadIdx.x_2*12), 32)*9)) + (ry.outer.outer*3))]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 1)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 1), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 1), 32)*9)) + (ry.outer.outer*3))]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 2)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 2), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 2), 32)*9)) + (ry.outer.outer*3))]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 3)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 3), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 3), 32)*9)) + (ry.outer.outer*3))]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 4)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 4), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 4), 32)*9)) + (ry.outer.outer*3))]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 5)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 5), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 5), 32)*9)) + (ry.outer.outer*3))]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 6)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 6), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 6), 32)*9)) + (ry.outer.outer*3))]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 7)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 7), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 7), 32)*9)) + (ry.outer.outer*3))]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 8)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 8), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 8), 32)*9)) + (ry.outer.outer*3))]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 9)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 9), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 9), 32)*9)) + (ry.outer.outer*3))]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 10)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 10), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 10), 32)*9)) + (ry.outer.outer*3))]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 11)] = (float32*)kernel_2[(((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 11), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 11), 32)*9)) + (ry.outer.outer*3))]
-          }
-        }
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[0]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[64]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[128]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[192]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[1]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[65]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[129]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[193]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[2]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[66]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[130]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[194]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[3]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[67]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[131]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[195]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[4]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[68]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[132]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[196]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[5]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[69]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[133]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[197]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[6]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[70]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[134]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[198]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[7]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[71]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[135]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[199]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[8]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[72]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[136]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[200]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[9]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[73]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[137]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[201]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[10]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[74]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[138]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[202]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[11]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[75]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[139]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[203]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[12]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[76]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[140]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[204]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[13]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[77]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[141]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[205]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[14]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[78]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[142]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[206]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[15]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[79]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[143]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[207]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[32]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[96]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[160]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[224]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[33]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[97]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[161]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[225]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[34]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[98]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[162]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[226]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[35]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[99]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[163]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[227]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[36]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[100]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[164]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[228]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[37]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[101]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[165]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[229]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[38]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[102]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[166]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[230]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[39]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[103]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[167]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[231]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[40]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[104]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[168]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[232]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[41]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[105]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[169]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[233]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[42]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[106]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[170]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[234]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[43]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[107]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[171]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[235]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[44]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[108]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[172]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[236]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[45]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[109]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[173]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[237]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[46]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[110]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[174]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[238]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[47]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[111]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[175]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[239]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[16]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[80]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[144]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[208]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[17]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[81]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[145]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[209]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[18]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[82]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[146]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[210]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[19]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[83]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[147]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[211]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[20]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[84]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[148]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[212]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[21]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[85]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[149]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[213]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[22]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[86]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[150]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[214]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[23]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[87]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[151]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[215]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[24]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[88]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[152]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[216]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[25]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[89]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[153]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[217]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[26]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[90]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[154]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[218]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[27]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[91]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[155]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[219]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[28]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[92]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[156]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[220]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[29]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[93]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[157]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[221]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[30]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[94]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[158]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[222]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[31]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[95]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[159]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[223]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[48]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[112]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[176]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[240]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[49]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[113]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[177]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[241]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[50]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[114]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[178]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[242]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[51]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[115]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[179]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[243]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[52]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[116]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[180]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[244]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[53]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[117]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[181]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[245]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[54]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[118]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[182]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[246]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[55]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[119]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[183]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[247]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[56]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[120]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[184]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[248]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[57]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[121]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[185]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[249]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[58]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[122]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[186]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[250]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[59]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[123]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[187]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[251]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[60]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[124]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[188]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[252]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[61]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[125]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[189]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[253]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[62]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[126]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[190]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[254]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[63]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[127]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[191]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[255]))
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[(threadIdx.x_1*2)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) - 7)], 0f32, dtype=float32)
-          pad_temp.shared[((threadIdx.x_1*2) + 1)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) - 7)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 98)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 91)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 98)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 91)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 196)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 189)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 196)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 189)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 294)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 287)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 294)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 287)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 392)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 385)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 392)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 385)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 490)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 483)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 490)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 483)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 588)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 581)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 588)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 581)], 0f32, dtype=float32)
+    compute_3[8] = 0f32
+    compute_3[9] = 0f32
+    compute_3[10] = 0f32
+    compute_3[11] = 0f32
+    compute_3[12] = 0f32
+    compute_3[13] = 0f32
+    for (rc.outer.outer: int32, 0, 256) {
+      attr [IterVar(threadIdx.x_1: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112 {
+        if @tir.likely((threadIdx.x_1 &lt; 41), dtype=bool) {
+          pad_temp.shared[(threadIdx.x_1*4)] = @tir.if_then_else(((((9 &lt;= floormod((threadIdx.x_1*4), 81)) &amp;&amp; (floormod((threadIdx.x_1*4), 81) &lt; 72)) &amp;&amp; (1 &lt;= floormod((threadIdx.x_1*4), 9))) &amp;&amp; (floormod((threadIdx.x_1*4), 9) &lt; 8)), (float32*)data_2[(((((rc.outer.outer*98) + (floordiv((threadIdx.x_1*4), 81)*49)) + (floordiv(floormod((threadIdx.x_1*4), 81), 9)*7)) + floormod((threadIdx.x_1*4), 9)) - 8)], 0f32, dtype=float32)
+        }
+        if @tir.likely((threadIdx.x_1 &lt; 41), dtype=bool) {
+          pad_temp.shared[((threadIdx.x_1*4) + 1)] = @tir.if_then_else(((((9 &lt;= floormod(((threadIdx.x_1*4) + 1), 81)) &amp;&amp; (floormod(((threadIdx.x_1*4) + 1), 81) &lt; 72)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*4) + 1), 9))) &amp;&amp; (floormod(((threadIdx.x_1*4) + 1), 9) &lt; 8)), (float32*)data_2[(((((rc.outer.outer*98) + (floordiv(((threadIdx.x_1*4) + 1), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 1), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 1), 9)) - 8)],  [...]
+        }
+        if @tir.likely((threadIdx.x_1 &lt; 40), dtype=bool) {
+          pad_temp.shared[((threadIdx.x_1*4) + 2)] = @tir.if_then_else(((((9 &lt;= floormod(((threadIdx.x_1*4) + 2), 81)) &amp;&amp; (floormod(((threadIdx.x_1*4) + 2), 81) &lt; 72)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*4) + 2), 9))) &amp;&amp; (floormod(((threadIdx.x_1*4) + 2), 9) &lt; 8)), (float32*)data_2[(((((rc.outer.outer*98) + (floordiv(((threadIdx.x_1*4) + 2), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 2), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 2), 9)) - 8)],  [...]
+        }
+        if @tir.likely((threadIdx.x_1 &lt; 40), dtype=bool) {
+          pad_temp.shared[((threadIdx.x_1*4) + 3)] = @tir.if_then_else(((((9 &lt;= floormod(((threadIdx.x_1*4) + 3), 81)) &amp;&amp; (floormod(((threadIdx.x_1*4) + 3), 81) &lt; 72)) &amp;&amp; (1 &lt;= floormod(((threadIdx.x_1*4) + 3), 9))) &amp;&amp; (floormod(((threadIdx.x_1*4) + 3), 9) &lt; 8)), (float32*)data_2[(((((rc.outer.outer*98) + (floordiv(((threadIdx.x_1*4) + 3), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 3), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 3), 9)) - 8)],  [...]
         }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 686)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 679)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 686)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 679)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 784)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 777)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 784)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 777)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 882)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 875)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 882)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 875)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 980)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 973)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 980)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 973)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1078)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1071)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1078)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1071)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1176)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1169)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1176)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1169)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1274)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1267)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1274)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1267)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1372)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1365)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1372)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1365)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1470)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1463)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1470)] = @tir.if_then_else(((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1463)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[(threadIdx.x_2*12)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2*12), 32)*4608)) + (rc.outer.outer*288)) + (floormod((threadIdx.x_2*12), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 1)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 1), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 1), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 2)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 2), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 2), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 3)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 3), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 3), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 4)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 4), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 4), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 5)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 5), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 5), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 6)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 6), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 6), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 7)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 7), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 7), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 8)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 8), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 8), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 9)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 9), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 9), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 10)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 10), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 10), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 11)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 11), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 11), 32)*9)) + (ry.outer.outer*3)) + 1)]
-          }
-        }
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[0]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[64]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[128]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[192]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[1]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[65]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[129]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[193]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[2]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[66]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[130]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[194]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[3]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[67]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[131]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[195]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[4]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[68]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[132]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[196]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[5]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[69]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[133]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[197]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[6]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[70]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[134]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[198]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[7]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[71]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[135]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[199]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[8]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[72]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[136]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[200]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[9]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[73]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[137]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[201]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[10]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[74]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[138]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[202]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[11]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[75]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[139]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[203]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[12]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[76]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[140]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[204]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[13]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[77]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[141]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[205]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[14]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[78]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[142]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[206]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[15]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[79]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[143]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[207]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[32]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[96]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[160]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[224]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[33]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[97]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[161]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[225]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[34]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[98]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[162]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[226]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[35]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[99]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[163]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[227]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[36]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[100]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[164]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[228]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[37]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[101]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[165]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[229]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[38]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[102]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[166]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[230]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[39]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[103]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[167]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[231]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[40]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[104]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[168]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[232]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[41]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[105]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[169]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[233]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[42]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[106]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[170]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[234]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[43]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[107]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[171]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[235]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[44]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[108]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[172]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[236]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[45]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[109]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[173]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[237]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[46]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[110]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[174]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[238]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[47]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[111]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[175]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[239]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[16]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[80]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[144]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[208]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[17]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[81]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[145]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[209]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[18]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[82]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[146]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[210]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[19]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[83]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[147]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[211]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[20]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[84]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[148]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[212]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[21]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[85]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[149]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[213]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[22]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[86]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[150]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[214]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[23]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[87]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[151]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[215]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[24]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[88]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[152]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[216]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[25]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[89]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[153]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[217]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[26]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[90]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[154]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[218]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[27]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[91]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[155]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[219]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[28]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[92]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[156]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[220]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[29]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[93]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[157]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[221]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[30]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[94]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[158]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[222]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[31]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[95]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[159]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[223]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[48]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[112]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[176]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[240]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[49]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[113]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[177]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[241]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[50]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[114]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[178]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[242]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[51]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[115]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[179]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[243]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[52]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[116]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[180]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[244]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[53]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[117]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[181]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[245]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[54]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[118]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[182]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[246]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[55]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[119]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[183]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[247]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[56]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[120]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[184]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[248]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[57]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[121]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[185]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[249]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[58]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[122]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[186]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[250]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[59]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[123]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[187]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[251]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[60]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[124]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[188]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[252]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[61]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[125]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[189]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[253]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[62]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[126]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[190]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[254]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[63]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[127]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[191]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[255]))
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[(threadIdx.x_1*2)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) - 6)], 0f32, dtype=float32)
-          pad_temp.shared[((threadIdx.x_1*2) + 1)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) - 6)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 98)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 92)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 98)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 92)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 196)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 190)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 196)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 190)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 294)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 288)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 294)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 288)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 392)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 386)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 392)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 386)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 490)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 484)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 490)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 484)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 588)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 582)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 588)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 582)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 686)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 680)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 686)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 680)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 784)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 778)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 784)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 778)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 882)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 876)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 882)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 876)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 980)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 974)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 980)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 974)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1078)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1072)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1078)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1072)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1176)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1170)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1176)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1170)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1274)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1268)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1274)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1268)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1372)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1366)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1372)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1366)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_1, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          pad_temp.shared[((threadIdx.x_1*2) + 1470)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod((threadIdx.x_1*2), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod((threadIdx.x_1*2), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + (threadIdx.x_1*2)) + 1464)], 0f32, dtype=float32)
-          pad_temp.shared[(((threadIdx.x_1*2) + 1) + 1470)] = @tir.if_then_else((((1 &lt;= (floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer)) &amp;&amp; ((floordiv(floormod(((threadIdx.x_1*2) + 1), 49), 7) + ry.outer.outer) &lt; 8)) &amp;&amp; (floormod(((threadIdx.x_1*2) + 1), 7) &lt; 6)), (float32*)data_2[((((rc.outer.outer*1568) + (ry.outer.outer*7)) + ((threadIdx.x_1*2) + 1)) + 1464)], 0f32, dtype=float32)
-        }
-        attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 49 {
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[(threadIdx.x_2*12)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2*12), 32)*4608)) + (rc.outer.outer*288)) + (floormod((threadIdx.x_2*12), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 1)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 1), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 1), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 2)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 2), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 2), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 22), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 3)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 3), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 3), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 4)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 4), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 4), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 5)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 5), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 5), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 6)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 6), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 6), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 7)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 7), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 7), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 8)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 8), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 8), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 9)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 9), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 9), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 10)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 10), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 10), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-          if @tir.likely((threadIdx.x_2 &lt; 21), dtype=bool) {
-            kernel.shared[((threadIdx.x_2*12) + 11)] = (float32*)kernel_2[((((((blockIdx.x*36864) + (floordiv(((threadIdx.x_2*12) + 11), 32)*4608)) + (rc.outer.outer*288)) + (floormod(((threadIdx.x_2*12) + 11), 32)*9)) + (ry.outer.outer*3)) + 2)]
-          }
-        }
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[0]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[64]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[128]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[192]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[1]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[65]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[129]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[193]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[2]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[66]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[130]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[194]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[3]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[67]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[131]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[195]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[4]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[68]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[132]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[196]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[5]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[69]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[133]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[197]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[6]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[70]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[134]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[198]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[7]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[71]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[135]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[199]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[8]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[72]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[136]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[200]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[9]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[73]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[137]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[201]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[10]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[74]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[138]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[202]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[11]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[75]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[139]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[203]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[12]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[76]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[140]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[204]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[13]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[77]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[141]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[205]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[14]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[78]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[142]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[206]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[15]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[79]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[143]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[207]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[32]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[96]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[160]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[threadIdx.x]*(float32*)kernel.shared[224]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[33]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[97]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[161]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 49)]*(float32*)kernel.shared[225]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[34]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[98]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[162]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 98)]*(float32*)kernel.shared[226]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[35]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[99]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[163]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 147)]*(float32*)kernel.shared[227]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[36]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[100]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[164]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 196)]*(float32*)kernel.shared[228]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[37]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[101]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[165]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 245)]*(float32*)kernel.shared[229]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[38]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[102]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[166]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 294)]*(float32*)kernel.shared[230]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[39]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[103]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[167]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 343)]*(float32*)kernel.shared[231]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[40]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[104]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[168]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 392)]*(float32*)kernel.shared[232]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[41]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[105]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[169]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 441)]*(float32*)kernel.shared[233]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[42]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[106]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[170]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 490)]*(float32*)kernel.shared[234]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[43]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[107]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[171]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 539)]*(float32*)kernel.shared[235]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[44]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[108]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[172]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 588)]*(float32*)kernel.shared[236]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[45]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[109]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[173]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 637)]*(float32*)kernel.shared[237]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[46]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[110]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[174]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 686)]*(float32*)kernel.shared[238]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[47]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[111]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[175]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 735)]*(float32*)kernel.shared[239]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[16]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[80]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[144]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[208]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[17]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[81]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[145]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[209]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[18]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[82]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[146]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[210]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[19]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[83]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[147]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[211]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[20]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[84]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[148]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[212]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[21]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[85]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[149]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[213]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[22]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[86]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[150]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[214]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[23]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[87]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[151]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[215]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[24]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[88]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[152]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[216]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[25]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[89]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[153]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[217]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[26]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[90]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[154]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[218]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[27]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[91]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[155]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[219]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[28]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[92]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[156]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[220]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[29]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[93]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[157]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[221]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[30]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[94]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[158]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[222]))
-        compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[31]))
-        compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[95]))
-        compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[159]))
-        compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[223]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[48]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[112]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[176]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 784)]*(float32*)kernel.shared[240]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[49]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[113]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[177]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 833)]*(float32*)kernel.shared[241]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[50]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[114]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[178]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 882)]*(float32*)kernel.shared[242]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[51]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[115]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[179]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 931)]*(float32*)kernel.shared[243]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[52]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[116]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[180]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 980)]*(float32*)kernel.shared[244]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[53]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[117]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[181]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1029)]*(float32*)kernel.shared[245]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[54]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[118]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[182]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1078)]*(float32*)kernel.shared[246]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[55]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[119]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[183]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1127)]*(float32*)kernel.shared[247]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[56]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[120]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[184]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1176)]*(float32*)kernel.shared[248]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[57]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[121]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[185]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1225)]*(float32*)kernel.shared[249]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[58]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[122]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[186]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1274)]*(float32*)kernel.shared[250]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[59]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[123]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[187]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1323)]*(float32*)kernel.shared[251]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[60]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[124]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[188]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1372)]*(float32*)kernel.shared[252]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[61]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[125]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[189]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1421)]*(float32*)kernel.shared[253]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[62]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[126]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[190]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1470)]*(float32*)kernel.shared[254]))
-        compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[63]))
-        compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[127]))
-        compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[191]))
-        compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(threadIdx.x + 1519)]*(float32*)kernel.shared[255]))
       }
+      attr [IterVar(threadIdx.x_2: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
+      kernel.shared[threadIdx.x_2] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv(threadIdx.x_2, 18)*4608)) + (rc.outer.outer*18)) + floormod(threadIdx.x_2, 18))]
+      attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
+      kernel.shared[(threadIdx.x_2 + 112)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 112), 18)*4608)) + (rc.outer.outer*18)) + floormod((threadIdx.x_2 + 4), 18))]
+      attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
+      kernel.shared[(threadIdx.x_2 + 224)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 224), 18)*4608)) + (rc.outer.outer*18)) + floormod((threadIdx.x_2 + 8), 18))]
+      attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
+      kernel.shared[(threadIdx.x_2 + 336)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 336), 18)*4608)) + (rc.outer.outer*18)) + floormod((threadIdx.x_2 + 12), 18))]
+      attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
+      kernel.shared[(threadIdx.x_2 + 448)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 448), 18)*4608)) + (rc.outer.outer*18)) + floormod((threadIdx.x_2 + 16), 18))]
+      attr [IterVar(threadIdx.x_2, (nullptr), &quot;ThreadIndex&quot;, &quot;threadIdx.x&quot;)] &quot;thread_extent&quot; = 112;
+      if @tir.likely((threadIdx.x_2 &lt; 16), dtype=bool) {
+        kernel.shared[(threadIdx.x_2 + 560)] = (float32*)kernel_2[((((blockIdx.x*147456) + (floordiv((threadIdx.x_2 + 560), 18)*4608)) + (rc.outer.outer*18)) + floormod((threadIdx.x_2 + 2), 18))]
+      }
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7)*9)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*36)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7)*9)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 18)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 7)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 1)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 7)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 19)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 7)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 8)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 2)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 7)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 8)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 20)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 9)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 3)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 9)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 21)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 16)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 4)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 16)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 22)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 16)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 17)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 5)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 16)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 17)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 23)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 18)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 6)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 18)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 24)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 25)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 7)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 25)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 25)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 25)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 26)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 8)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 25)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 26)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 26)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 81)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 82)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 9)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 81)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 82)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 27)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 82)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 88)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 10)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 82)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 88)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 28)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 88)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 89)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 11)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 88)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 89)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 29)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 90)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 91)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 12)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 90)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 91)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 30)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 91)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 97)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 13)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 91)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 97)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 31)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 97)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 98)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 14)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 97)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 98)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 32)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 99)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 100)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 15)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 99)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 100)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 33)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 100)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 106)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 16)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 100)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 106)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 34)]))
+      compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+      compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+      compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+      compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+      compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+      compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 106)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+      compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 107)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 17)]))
+      compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+      compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+      compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+      compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+      compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+      compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 106)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
+      compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 107)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*36) + 35)]))
     }
     for (i1.inner: int32, 0, 2) {
-      compute_2[(((blockIdx.x*392) + (i1.inner*49)) + threadIdx.x)] = max(((float32*)compute_3[i1.inner] + (float32*)bias_2[((blockIdx.x*8) + i1.inner)]), 0f32)
-      compute_2[((((blockIdx.x*392) + (i1.inner*49)) + threadIdx.x) + 98)] = max(((float32*)compute_3[(i1.inner + 2)] + (float32*)bias_2[(((blockIdx.x*8) + i1.inner) + 2)]), 0f32)
-      compute_2[((((blockIdx.x*392) + (i1.inner*49)) + threadIdx.x) + 196)] = max(((float32*)compute_3[(i1.inner + 4)] + (float32*)bias_2[(((blockIdx.x*8) + i1.inner) + 4)]), 0f32)
-      compute_2[((((blockIdx.x*392) + (i1.inner*49)) + threadIdx.x) + 294)] = max(((float32*)compute_3[(i1.inner + 6)] + (float32*)bias_2[(((blockIdx.x*8) + i1.inner) + 6)]), 0f32)
+      for (i3.inner: int32, 0, 7) {
+        compute_2[(((((blockIdx.x*1568) + (floordiv(threadIdx.x, 7)*98)) + (i1.inner*49)) + (floormod(threadIdx.x, 7)*7)) + i3.inner)] = max(((float32*)compute_3[((i1.inner*7) + i3.inner)] + (float32*)bias_2[(((blockIdx.x*32) + (floordiv(threadIdx.x, 7)*2)) + i1.inner)]), 0f32)
+      }
     }
   }
 }
@@ -1457,7 +666,7 @@ cooperative fetching, unrolling and operator fusion.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Execution time of this operator: 0.322 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Execution time of this operator: 0.356 ms
 </pre></div>
 </div>
 </div>
@@ -1469,7 +678,7 @@ resume the search, and perform other analyses.</p>
 <p>Here is an example where we load the best schedule from a file,
 print the equivalent python schedule API, and build the binary again.</p>
 <div class="highlight-default notranslate"><div class="highlight"><pre><span class="c1"># Load the measuremnt record for the best schedule</span>
-<span class="n">inp</span><span class="p">,</span> <span class="n">res</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">load_best</span><span class="p">(</span><span class="s2">&quot;conv2d.json&quot;</span><span class="p">,</span> <span class="n">task</span><span class="o">.</span><span class="n">workload_key</span><span class="p">)</span>
+<span class="n">inp</span><span class="p">,</span> <span class="n">res</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.load_best" title="View documentation for tvm.auto_scheduler.load_best"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">load_best</span></a><span class="p">(</span><span class="s2">&quot;conv2d.json&quot;</span><span class="p">,</span> <span class="n">task</span><span class="o">.</span><span cla [...]
 
 <span class="c1"># Print equivalent python schedule API. This can be used for debugging and</span>
 <span class="c1"># learning the behavior of the auto-scheduler.</span>
@@ -1495,34 +704,34 @@ nn_o_o_o_i, nn_o_o_i = s[compute].split(nn_o_o_i, factor=1)
 nn_o_o_o_o, nn_o_o_o_i = s[compute].split(nn_o_o_o_i, factor=1)
 ff_o_i, ff_i = s[compute].split(ff, factor=1)
 ff_o_o_i, ff_o_i = s[compute].split(ff_o_i, factor=1)
-ff_o_o_o_i, ff_o_o_i = s[compute].split(ff_o_o_i, factor=32)
+ff_o_o_o_i, ff_o_o_i = s[compute].split(ff_o_o_i, factor=16)
 ff_o_o_o_o, ff_o_o_o_i = s[compute].split(ff_o_o_o_i, factor=1)
-yy_o_i, yy_i = s[compute].split(yy, factor=7)
-yy_o_o_i, yy_o_i = s[compute].split(yy_o_i, factor=1)
+yy_o_i, yy_i = s[compute].split(yy, factor=1)
+yy_o_o_i, yy_o_i = s[compute].split(yy_o_i, factor=7)
 yy_o_o_o_i, yy_o_o_i = s[compute].split(yy_o_o_i, factor=1)
 yy_o_o_o_o, yy_o_o_o_i = s[compute].split(yy_o_o_o_i, factor=1)
 xx_o_i, xx_i = s[compute].split(xx, factor=1)
 xx_o_o_i, xx_o_i = s[compute].split(xx_o_i, factor=1)
-xx_o_o_o_i, xx_o_o_i = s[compute].split(xx_o_o_i, factor=1)
+xx_o_o_o_i, xx_o_o_i = s[compute].split(xx_o_o_i, factor=7)
 xx_o_o_o_o, xx_o_o_o_i = s[compute].split(xx_o_o_o_i, factor=1)
-rc_o_i, rc_i = s[compute].split(rc, factor=16)
-rc_o_o, rc_o_i = s[compute].split(rc_o_i, factor=1)
+rc_o_i, rc_i = s[compute].split(rc, factor=8)
+rc_o_o, rc_o_i = s[compute].split(rc_o_i, factor=2)
 ry_o_i, ry_i = s[compute].split(ry, factor=3)
 ry_o_o, ry_o_i = s[compute].split(ry_o_i, factor=1)
-rx_o_i, rx_i = s[compute].split(rx, factor=3)
-rx_o_o, rx_o_i = s[compute].split(rx_o_i, factor=1)
+rx_o_i, rx_i = s[compute].split(rx, factor=1)
+rx_o_o, rx_o_i = s[compute].split(rx_o_i, factor=3)
 s[compute].reorder(nn_o_o_o_o, ff_o_o_o_o, yy_o_o_o_o, xx_o_o_o_o, nn_o_o_o_i, ff_o_o_o_i, yy_o_o_o_i, xx_o_o_o_i, nn_o_o_i, ff_o_o_i, yy_o_o_i, xx_o_o_i, rc_o_o, ry_o_o, rx_o_o, rc_o_i, ry_o_i, rx_o_i, nn_o_i, ff_o_i, yy_o_i, xx_o_i, rc_i, ry_i, rx_i, nn_i, ff_i, yy_i, xx_i)
 i0_o_i, i0_i = s[compute].split(i0, factor=1)
 i0_o_o_i, i0_o_i = s[compute].split(i0_o_i, factor=1)
 i0_o_o_o, i0_o_o_i = s[compute].split(i0_o_o_i, factor=1)
 i1_o_i, i1_i = s[compute].split(i1, factor=1)
-i1_o_o_i, i1_o_i = s[compute].split(i1_o_i, factor=32)
+i1_o_o_i, i1_o_i = s[compute].split(i1_o_i, factor=16)
 i1_o_o_o, i1_o_o_i = s[compute].split(i1_o_o_i, factor=1)
 i2_o_i, i2_i = s[compute].split(i2, factor=7)
 i2_o_o_i, i2_o_i = s[compute].split(i2_o_i, factor=1)
 i2_o_o_o, i2_o_o_i = s[compute].split(i2_o_o_i, factor=1)
 i3_o_i, i3_i = s[compute].split(i3, factor=1)
-i3_o_o_i, i3_o_i = s[compute].split(i3_o_i, factor=1)
+i3_o_o_i, i3_o_i = s[compute].split(i3_o_i, factor=7)
 i3_o_o_o, i3_o_o_i = s[compute].split(i3_o_o_i, factor=1)
 s[compute].reorder(i0_o_o_o, i1_o_o_o, i2_o_o_o, i3_o_o_o, i0_o_o_i, i1_o_o_i, i2_o_o_i, i3_o_o_i, i0_o_i, i1_o_i, i2_o_i, i3_o_i, i0_i, i1_i, i2_i, i3_i)
 s[compute].compute_at(s[compute], i3_o_i)
@@ -1540,14 +749,14 @@ s[compute].bind(i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused, tvm.threa
 i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(i0_o_i, i1_o_i, i2_o_i, i3_o_i)
 s[compute].bind(i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, tvm.thread_axis(&quot;threadIdx.x&quot;))
 ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(ax0, ax1, ax2, ax3)
-ax0_ax1_fused_ax2_fused_ax3_fused_o, ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
+ax0_ax1_fused_ax2_fused_ax3_fused_o, ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused, factor=3)
 s[kernel_shared].vectorize(ax0_ax1_fused_ax2_fused_ax3_fused_i)
-ax0_ax1_fused_ax2_fused_ax3_fused_o_o, ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=32)
+ax0_ax1_fused_ax2_fused_ax3_fused_o_o, ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)
 s[kernel_shared].bind(ax0_ax1_fused_ax2_fused_ax3_fused_o_i, tvm.thread_axis(&quot;threadIdx.x&quot;))
 ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(ax0, ax1, ax2, ax3)
-ax0_ax1_fused_ax2_fused_ax3_fused_o, ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
+ax0_ax1_fused_ax2_fused_ax3_fused_o, ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused, factor=4)
 s[pad_temp_shared].vectorize(ax0_ax1_fused_ax2_fused_ax3_fused_i)
-ax0_ax1_fused_ax2_fused_ax3_fused_o_o, ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=32)
+ax0_ax1_fused_ax2_fused_ax3_fused_o_o, ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)
 s[pad_temp_shared].bind(ax0_ax1_fused_ax2_fused_ax3_fused_o_i, tvm.thread_axis(&quot;threadIdx.x&quot;))
 s[compute].pragma(nn_o_o_o_o, &quot;auto_unroll_max_step&quot;, 64)
 s[compute].pragma(nn_o_o_o_o, &quot;unroll_explicit&quot;, True)
@@ -1558,17 +767,17 @@ In this case, we need to create the search policy and cost model by ourselves
 and resume the status of search policy and cost model with the log file.
 In the example below we resume the status and do more 5 trials.</p>
 <div class="highlight-default notranslate"><div class="highlight"><pre><span class="n">log_file</span> <span class="o">=</span> <span class="s2">&quot;conv2d.json&quot;</span>
-<span class="n">cost_model</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">XGBModel</span><span class="p">()</span>
+<span class="n">cost_model</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.XGBModel" title="View documentation for tvm.auto_scheduler.XGBModel"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">XGBModel</span></a><span class="p">()</span>
 <span class="n">cost_model</span><span class="o">.</span><span class="n">update_from_file</span><span class="p">(</span><span class="n">log_file</span><span class="p">)</span>
-<span class="n">search_policy</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">SketchPolicy</span><span class="p">(</span>
-    <span class="n">task</span><span class="p">,</span> <span class="n">cost_model</span><span class="p">,</span> <span class="n">init_search_callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">PreloadMeasuredStates</span><span class="p">(</span><span class="n">log_file</span><span class="p">)]</span>
+<span class="n">search_policy</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.SketchPolicy" title="View documentation for tvm.auto_scheduler.SketchPolicy"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">SketchPolicy</span></a><span class="p">(</span>
+    <span class="n">task</span><span class="p">,</span> <span class="n">cost_model</span><span class="p">,</span> <span class="n">init_search_callbacks</span><span class="o">=</span><span class="p">[</span><a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.PreloadMeasuredStates" title="View documentation for tvm.auto_scheduler.PreloadMeasuredStates"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">PreloadMeasuredStates</span></a><span class="p">( [...]
 <span class="p">)</span>
-<span class="n">tune_option</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">TuningOptions</span><span class="p">(</span>
+<span class="n">tune_option</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.TuningOptions" title="View documentation for tvm.auto_scheduler.TuningOptions"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">TuningOptions</span></a><span class="p">(</span>
     <span class="n">num_measure_trials</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
     <span class="n">runner</span><span class="o">=</span><span class="n">measure_ctx</span><span class="o">.</span><span class="n">runner</span><span class="p">,</span>
-    <span class="n">measure_callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">RecordToFile</span><span class="p">(</span><span class="n">log_file</span><span class="p">)],</span>
+    <span class="n">measure_callbacks</span><span class="o">=</span><span class="p">[</span><a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.RecordToFile" title="View documentation for tvm.auto_scheduler.RecordToFile"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">RecordToFile</span></a><span class="p">(</span><span class="n">log_file</span><span class="p">)],</span>
 <span class="p">)</span>
-<span class="n">sch</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#module-tvm.auto_scheduler.auto_schedule" title="View documentation for tvm.auto_scheduler.auto_schedule"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">auto_schedule</span></a><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">search_policy</span><span class="p">,</span> [...]
+<span class="n">sch</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.auto_schedule" title="View documentation for tvm.auto_scheduler.auto_schedule"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">auto_schedule</span></a><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">search_policy</span><span class="p">,</span> <span  [...]
 
 <span class="c1"># kill the measurement process</span>
 <span class="k">del</span> <span class="n">measure_ctx</span>
@@ -1578,7 +787,7 @@ In the example below we resume the status and do more 5 trials.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  51.342 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  51.937 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorials-auto-scheduler-tune-conv2d-layer-cuda-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/678f3c372a599a18d909aed0fefb30be/tune_conv2d_layer_cuda.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tune_conv2d_layer_cuda.py</span></code></a></p>
diff --git a/docs/tutorials/auto_scheduler/tune_matmul_x86.html b/docs/tutorials/auto_scheduler/tune_matmul_x86.html
index a29c3ce..d016265 100644
--- a/docs/tutorials/auto_scheduler/tune_matmul_x86.html
+++ b/docs/tutorials/auto_scheduler/tune_matmul_x86.html
@@ -246,11 +246,15 @@ From these tensors, the auto-scheduler can get the whole computational graph.</p
 <div class="section" id="create-the-search-task">
 <h2>Create the search task<a class="headerlink" href="#create-the-search-task" title="Permalink to this headline">¶</a></h2>
 <p>We then create a search task with N=L=M=128 and dtype=”float32”
-If your machine supports avx instructions, you can
-- replace “llvm” below with “llvm -mcpu=core-avx2” to enable AVX2
-- replace “llvm” below with “llvm -mcpu=skylake-avx512” to enable AVX-512</p>
+If your machine supports avx instructions, you can</p>
+<blockquote>
+<div><ul class="simple">
+<li><p>replace “llvm” below with “llvm -mcpu=core-avx2” to enable AVX2</p></li>
+<li><p>replace “llvm” below with “llvm -mcpu=skylake-avx512” to enable AVX-512</p></li>
+</ul>
+</div></blockquote>
 <div class="highlight-default notranslate"><div class="highlight"><pre><span class="n">target</span> <span class="o">=</span> <a href="../../api/python/target.html#tvm.target.Target" title="View documentation for tvm.target.Target"><span class="n">tvm</span><span class="o">.</span><span class="n">target</span><span class="o">.</span><span class="n">Target</span></a><span class="p">(</span><span class="s2">&quot;llvm&quot;</span><span class="p">)</span>
-<span class="n">task</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">create_task</span><span class="p">(</span><span class="n">matmul_add</span><span class="p">,</span> <span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="s2">&quot;float32&quot;</span><span class="p">),</span> <span class=" [...]
+<span class="n">task</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.create_task" title="View documentation for tvm.auto_scheduler.create_task"><span class="n">tvm</span><span class="o">.</span><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">create_task</span></a><span class="p">(</span><span class="n">matmul_add</span><span class="p">,</span> <span class="p">(</span><span class="mi">128</span><span class="p">, [...]
 
 <span class="c1"># Inspect the computational graph</span>
 <span class="nb">print</span><span class="p">(</span><span class="n">task</span><span class="o">.</span><span class="n">compute_dag</span><span class="p">)</span>
@@ -266,16 +270,16 @@ out(i, j) = (matmul[i, j] + C[i, j])
 </div>
 <p>Next, we set parameters for the auto-scheduler.</p>
 <ul class="simple">
-<li><p><cite>num_measure_trials</cite> is the number of measurement trials we can use during the search.
+<li><p><code class="code docutils literal notranslate"><span class="pre">num_measure_trials</span></code> is the number of measurement trials we can use during the search.
 We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a
 good value for the search to converge. You can do more trials according to your time budget.</p></li>
-<li><p>In addition, we use <cite>RecordToFile</cite> to dump measurement records into a file <cite>matmul.json</cite>.
+<li><p>In addition, we use <code class="code docutils literal notranslate"><span class="pre">RecordToFile</span></code> to dump measurement records into a file <cite>matmul.json</cite>.
 The measurement records can be used to query the history best, resume the search,
 and do more analyses later.</p></li>
-<li><p>see <a class="reference internal" href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.auto_schedule.TuningOptions" title="tvm.auto_scheduler.auto_schedule.TuningOptions"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.auto_schedule.TuningOptions</span></code></a>: for more parameters</p></li>
+<li><p>see <a class="reference internal" href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.TuningOptions" title="tvm.auto_scheduler.TuningOptions"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.TuningOptions</span></code></a> for more parameters</p></li>
 </ul>
-<div class="highlight-default notranslate"><div class="highlight"><pre><span class="n">tune_option</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">TuningOptions</span><span class="p">(</span>
-    <span class="n">num_measure_trials</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">measure_callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">RecordToFile</span><span class="p">(</span><span class="s2">&quot;matmul.json&quot;</span><span class="p">)]</span>
+<div class="highlight-default notranslate"><div class="highlight"><pre><span class="n">tune_option</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.TuningOptions" title="View documentation for tvm.auto_scheduler.TuningOptions"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">TuningOptions</span></a><span class="p">(</span>
+    <span class="n">num_measure_trials</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">measure_callbacks</span><span class="o">=</span><span class="p">[</span><a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.RecordToFile" title="View documentation for tvm.auto_scheduler.RecordToFile"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">RecordToFile</span></a><span class="p">(</span><span class="s2">& [...]
 <span class="p">)</span>
 </pre></div>
 </div>
@@ -285,11 +289,11 @@ and do more analyses later.</p></li>
 <p>Now we get all inputs ready. Pretty simple, isn’t it?
 We can kick off the search and let the auto-scheduler do its magic.
 After some measurement trials, it will return the best schedule it found.</p>
-<div class="highlight-default notranslate"><div class="highlight"><pre><span class="n">sch</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#module-tvm.auto_scheduler.auto_schedule" title="View documentation for tvm.auto_scheduler.auto_schedule"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">auto_schedule</span></a><span class="p">(</span><span class="n">task</span><span class="p [...]
+<div class="highlight-default notranslate"><div class="highlight"><pre><span class="n">sch</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.auto_schedule" title="View documentation for tvm.auto_scheduler.auto_schedule"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">auto_schedule</span></a><span class="p">(</span><span class="n">task</span><span class="p">,</sp [...]
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>*T*T*T*T*T*T*T*T*T*T
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>*T*T*T*T*T*T*T*T*T
 </pre></div>
 </div>
 <p>We can lower the schedule to see the IR after auto-scheduling.
@@ -302,23 +306,345 @@ parallelization, vectorization, unrolling and operator fusion.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A_1: handle, B_1: handle, C_1: handle, out_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {out: Buffer(out_2: Pointer(float32), float32, [128, 128], []),
-             B: Buffer(B_2: Pointer(float32), float32, [128, 128], []),
              C: Buffer(C_2: Pointer(float32), float32, [128, 128], []),
+             B: Buffer(B_2: Pointer(float32), float32, [128, 128], []),
              A: Buffer(A_2: Pointer(float32), float32, [128, 128], [])}
   buffer_map = {A_1: A, B_1: B, C_1: C, out_1: out} {
   attr [matmul: Pointer(float32)] &quot;storage_scope&quot; = &quot;global&quot;;
   allocate(matmul, float32, [16384]) {
-    for (i: int32, 0, 128) {
-      for (j: int32, 0, 128) {
-        matmul[((i*128) + j)] = 0f32
-        for (k: int32, 0, 128) {
-          matmul[((i*128) + j)] = ((float32*)matmul[((i*128) + j)] + ((float32*)A_2[((i*128) + k)]*(float32*)B_2[((k*128) + j)]))
+    for (i.outer.outer.inner: int32, 0, 8) {
+      for (j.outer.outer.inner: int32, 0, 4) {
+        matmul[ramp(((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 128), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 256), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 384), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 512), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 640), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 768), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 896), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 2), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 130), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 258), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 386), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 514), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 642), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 770), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 898), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 4), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 132), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 260), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 388), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 516), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 644), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 772), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 900), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 6), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 134), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 262), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 390), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 518), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 646), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 774), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 902), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 8), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 136), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 264), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 392), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 520), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 648), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 776), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 904), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 10), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 138), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 266), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 394), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 522), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 650), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 778), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 906), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 12), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 140), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 268), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 396), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 524), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 652), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 780), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 908), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 14), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 142), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 270), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 398), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 526), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 654), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 782), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 910), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 16), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 144), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 272), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 400), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 528), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 656), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 784), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 912), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 18), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 146), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 274), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 402), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 530), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 658), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 786), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 914), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 20), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 148), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 276), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 404), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 532), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 660), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 788), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 916), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 22), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 150), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 278), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 406), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 534), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 662), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 790), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 918), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 24), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 152), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 280), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 408), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 536), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 664), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 792), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 920), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 26), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 154), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 282), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 410), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 538), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 666), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 794), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 922), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 28), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 156), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 284), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 412), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 540), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 668), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 796), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 924), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 30), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 158), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 286), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 414), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 542), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 670), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 798), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 926), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1024), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1152), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1280), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1408), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1536), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1664), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1792), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1920), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1026), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1154), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1282), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1410), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1538), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1666), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1794), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1922), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1028), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1156), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1284), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1412), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1540), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1668), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1796), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1924), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1030), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1158), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1286), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1414), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1542), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1670), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1798), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1926), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1032), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1160), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1288), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1416), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1544), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1672), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1800), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1928), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1034), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1162), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1290), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1418), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1546), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1674), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1802), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1930), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1036), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1164), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1292), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1420), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1548), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1676), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1804), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1932), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1038), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1166), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1294), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1422), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1550), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1678), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1806), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1934), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1040), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1168), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1296), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1424), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1552), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1680), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1808), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1936), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1042), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1170), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1298), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1426), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1554), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1682), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1810), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1938), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1044), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1172), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1300), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1428), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1556), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1684), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1812), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1940), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1046), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1174), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1302), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1430), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1558), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1686), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1814), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1942), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1048), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1176), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1304), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1432), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1560), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1688), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1816), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1944), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1050), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1178), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1306), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1434), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1562), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1690), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1818), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1946), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1052), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1180), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1308), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1436), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1564), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1692), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1820), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1948), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1054), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1182), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1310), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1438), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1566), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1694), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1822), 1, 2)] = broadcast(0f32, 2)
+        matmul[ramp((((i.outer.outer.inner*2048) + (j.outer.outer.inner*32)) + 1950), 1, 2)] = broadcast(0f32, 2)
+        for (k.outer: int32, 0, 16) {
+          for (i.outer.inner: int32, 0, 2) {
+            for (j.outer.inner: int32, 0, 16) {
+              matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[(((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8))], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 128)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 256)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 384)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 512)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 640)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 768)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 896)], 2)*(float32x2*)B_2[ramp((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)]))
+              matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 1)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 129)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 257)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 385)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 513)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 641)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 769)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 897)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128 [...]
+              matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 2)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 130)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 258)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 386)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 514)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 642)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 770)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 898)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256 [...]
+              matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 3)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 131)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 259)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 387)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 515)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 643)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 771)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 899)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384 [...]
+              matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 4)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 132)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 260)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 388)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 516)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 644)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 772)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 900)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512 [...]
+              matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 5)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 133)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 261)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 389)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 517)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 645)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 773)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 901)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640 [...]
+              matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 6)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 134)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 262)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 390)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 518)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 646)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 774)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 902)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768 [...]
+              matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] = ((float32x2*)matmul[ramp(((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 7)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)]))
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 128), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 135)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 256), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 263)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 384), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 391)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 512), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 519)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 640), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 647)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 768), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 775)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896 [...]
+              matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] = ((float32x2*)matmul[ramp((((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896), 1, 2)] + (broadcast((float32*)A_2[((((i.outer.outer.inner*2048) + (i.outer.inner*1024)) + (k.outer*8)) + 903)], 2)*(float32x2*)B_2[ramp(((((k.outer*1024) + (j.outer.outer.inner*32)) + (j.outer.inner*2)) + 896 [...]
+            }
+          }
         }
       }
     }
-    for (i_1: int32, 0, 128) {
-      for (j_1: int32, 0, 128) {
-        out_2[((i_1*128) + j_1)] = ((float32*)matmul[((i_1*128) + j_1)] + (float32*)C_2[((i_1*128) + j_1)])
+    for (i.inner: int32, 0, 128) {
+      for (j.inner: int32, 0, 128) {
+        out_2[((i.inner*128) + j.inner)] = ((float32*)matmul[((i.inner*128) + j.inner)] + (float32*)C_2[((i.inner*128) + j.inner)])
       }
     }
   }
@@ -354,7 +680,7 @@ parallelization, vectorization, unrolling and operator fusion.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Execution time of this operator: 2.217 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Execution time of this operator: 0.371 ms
 </pre></div>
 </div>
 </div>
@@ -366,7 +692,7 @@ resume the search, and perform other analyses.</p>
 <p>Here is an example where we load the best schedule from a file,
 print the equivalent python schedule API, and build the binary again.</p>
 <div class="highlight-default notranslate"><div class="highlight"><pre><span class="c1"># Load the measuremnt record for the best schedule</span>
-<span class="n">inp</span><span class="p">,</span> <span class="n">res</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">load_best</span><span class="p">(</span><span class="s2">&quot;matmul.json&quot;</span><span class="p">,</span> <span class="n">task</span><span class="o">.</span><span class="n">workload_key</span><span class="p">)</span>
+<span class="n">inp</span><span class="p">,</span> <span class="n">res</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.load_best" title="View documentation for tvm.auto_scheduler.load_best"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">load_best</span></a><span class="p">(</span><span class="s2">&quot;matmul.json&quot;</span><span class="p">,</span> <span class="n">task</span><span class="o">.</span><span cla [...]
 
 <span class="c1"># Print equivalent python schedule API. This can be used for debugging and</span>
 <span class="c1"># learning the behavior of the auto-scheduler.</span>
@@ -410,15 +736,15 @@ In this case, we need to create the search policy and cost model by ourselves
 and resume the status of search policy and cost model with the log file.
 In the example below we resume the status and do more 5 trials.</p>
 <div class="highlight-default notranslate"><div class="highlight"><pre><span class="k">def</span> <span class="nf">resume_search</span><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">log_file</span><span class="p">):</span>
-    <span class="n">cost_model</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">XGBModel</span><span class="p">()</span>
+    <span class="n">cost_model</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.XGBModel" title="View documentation for tvm.auto_scheduler.XGBModel"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">XGBModel</span></a><span class="p">()</span>
     <span class="n">cost_model</span><span class="o">.</span><span class="n">update_from_file</span><span class="p">(</span><span class="n">log_file</span><span class="p">)</span>
-    <span class="n">search_policy</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">SketchPolicy</span><span class="p">(</span>
-        <span class="n">task</span><span class="p">,</span> <span class="n">cost_model</span><span class="p">,</span> <span class="n">init_search_callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">PreloadMeasuredStates</span><span class="p">(</span><span class="n">log_file</span><span class="p">)]</span>
+    <span class="n">search_policy</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.SketchPolicy" title="View documentation for tvm.auto_scheduler.SketchPolicy"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">SketchPolicy</span></a><span class="p">(</span>
+        <span class="n">task</span><span class="p">,</span> <span class="n">cost_model</span><span class="p">,</span> <span class="n">init_search_callbacks</span><span class="o">=</span><span class="p">[</span><a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.PreloadMeasuredStates" title="View documentation for tvm.auto_scheduler.PreloadMeasuredStates"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">PreloadMeasuredStates</span></a><span class=" [...]
     <span class="p">)</span>
-    <span class="n">tune_option</span> <span class="o">=</span> <span class="n">auto_scheduler</span><span class="o">.</span><span class="n">TuningOptions</span><span class="p">(</span>
-        <span class="n">num_measure_trials</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">measure_callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">RecordToFile</span><span class="p">(</span><span class="n">log_file</span><span class="p">)]</span>
+    <span class="n">tune_option</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.TuningOptions" title="View documentation for tvm.auto_scheduler.TuningOptions"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">TuningOptions</span></a><span class="p">(</span>
+        <span class="n">num_measure_trials</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">measure_callbacks</span><span class="o">=</span><span class="p">[</span><a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.RecordToFile" title="View documentation for tvm.auto_scheduler.RecordToFile"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">RecordToFile</span></a><span class="p">(</span><span class="n" [...]
     <span class="p">)</span>
-    <span class="n">sch</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#module-tvm.auto_scheduler.auto_schedule" title="View documentation for tvm.auto_scheduler.auto_schedule"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">auto_schedule</span></a><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">search_policy</span><span class="p">,</s [...]
+    <span class="n">sch</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <a href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.auto_schedule" title="View documentation for tvm.auto_scheduler.auto_schedule"><span class="n">auto_scheduler</span><span class="o">.</span><span class="n">auto_schedule</span></a><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">search_policy</span><span class="p">,</span> <s [...]
 
 
 <span class="c1"># resume_search(task, &quot;matmul.json&quot;)</span>
@@ -438,10 +764,10 @@ There are other workarounds for this problem.
 For example, you can start a new thread/process (with the builtin python library
 threading or multiprocessing) and run the tvm binaries in the new thread/process.
 This provides an isolation and avoids the conflict in the main thread/process.
-You can also use <a class="reference internal" href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.measure.LocalRPCMeasureContext" title="tvm.auto_scheduler.measure.LocalRPCMeasureContext"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.measure.LocalRPCMeasureContext</span></code></a> for auto-scheduler,
+You can also use <a class="reference internal" href="../../api/python/auto_scheduler.html#tvm.auto_scheduler.LocalRPCMeasureContext" title="tvm.auto_scheduler.LocalRPCMeasureContext"><code class="xref any py py-class docutils literal notranslate"><span class="pre">auto_scheduler.LocalRPCMeasureContext</span></code></a> for auto-scheduler,
 as shown in the GPU tutorial (<a class="reference internal" href="tune_conv2d_layer_cuda.html#auto-scheduler-conv-gpu"><span class="std std-ref">Auto-scheduling a convolution layer for GPU</span></a>).</p>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  56.117 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  46.419 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorials-auto-scheduler-tune-matmul-x86-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/91b0339c8f3cc2594cee580dc450149a/tune_matmul_x86.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">tune_matmul_x86.py</span></code></a></p>
diff --git a/docs/tutorials/autotvm/sg_execution_times.html b/docs/tutorials/autotvm/sg_execution_times.html
index da75f85..3c9bb12 100644
--- a/docs/tutorials/autotvm/sg_execution_times.html
+++ b/docs/tutorials/autotvm/sg_execution_times.html
@@ -192,14 +192,14 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-tutorials-autotvm-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>01:17.575</strong> total execution time for <strong>tutorials_autotvm</strong> files:</p>
+<p><strong>01:10.287</strong> total execution time for <strong>tutorials_autotvm</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:49.825</strong>: <a class="reference internal" href="tune_conv2d_cuda.html#sphx-glr-tutorials-autotvm-tune-conv2d-cuda-py"><span class="std std-ref">Tuning High Performance Convolution on NVIDIA GPUs</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_cuda.py</span></code>)</p></li>
-<li><p><strong>00:27.115</strong>: <a class="reference internal" href="tune_simple_template.html#sphx-glr-tutorials-autotvm-tune-simple-template-py"><span class="std std-ref">Writing tunable template and Using auto-tuner</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_simple_template.py</span></code>)</p></li>
-<li><p><strong>00:00.184</strong>: <a class="reference internal" href="tune_relay_cuda.html#sphx-glr-tutorials-autotvm-tune-relay-cuda-py"><span class="std std-ref">Auto-tuning a convolutional network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_cuda.py</span></code>)</p></li>
-<li><p><strong>00:00.159</strong>: <a class="reference internal" href="tune_relay_x86.html#sphx-glr-tutorials-autotvm-tune-relay-x86-py"><span class="std std-ref">Auto-tuning a convolutional network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_x86.py</span></code>)</p></li>
-<li><p><strong>00:00.147</strong>: <a class="reference internal" href="tune_relay_arm.html#sphx-glr-tutorials-autotvm-tune-relay-arm-py"><span class="std std-ref">Auto-tuning a convolutional network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_arm.py</span></code>)</p></li>
-<li><p><strong>00:00.144</strong>: <a class="reference internal" href="tune_relay_mobile_gpu.html#sphx-glr-tutorials-autotvm-tune-relay-mobile-gpu-py"><span class="std std-ref">Auto-tuning a convolutional network for Mobile GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_mobile_gpu.py</span></code>)</p></li>
+<li><p><strong>00:45.110</strong>: <a class="reference internal" href="tune_conv2d_cuda.html#sphx-glr-tutorials-autotvm-tune-conv2d-cuda-py"><span class="std std-ref">Tuning High Performance Convolution on NVIDIA GPUs</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_conv2d_cuda.py</span></code>)</p></li>
+<li><p><strong>00:24.539</strong>: <a class="reference internal" href="tune_simple_template.html#sphx-glr-tutorials-autotvm-tune-simple-template-py"><span class="std std-ref">Writing tunable template and Using auto-tuner</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_simple_template.py</span></code>)</p></li>
+<li><p><strong>00:00.186</strong>: <a class="reference internal" href="tune_relay_cuda.html#sphx-glr-tutorials-autotvm-tune-relay-cuda-py"><span class="std std-ref">Auto-tuning a convolutional network for NVIDIA GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_cuda.py</span></code>)</p></li>
+<li><p><strong>00:00.157</strong>: <a class="reference internal" href="tune_relay_x86.html#sphx-glr-tutorials-autotvm-tune-relay-x86-py"><span class="std std-ref">Auto-tuning a convolutional network for x86 CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_x86.py</span></code>)</p></li>
+<li><p><strong>00:00.148</strong>: <a class="reference internal" href="tune_relay_arm.html#sphx-glr-tutorials-autotvm-tune-relay-arm-py"><span class="std std-ref">Auto-tuning a convolutional network for ARM CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_arm.py</span></code>)</p></li>
+<li><p><strong>00:00.146</strong>: <a class="reference internal" href="tune_relay_mobile_gpu.html#sphx-glr-tutorials-autotvm-tune-relay-mobile-gpu-py"><span class="std std-ref">Auto-tuning a convolutional network for Mobile GPU</span></a> (<code class="docutils literal notranslate"><span class="pre">tune_relay_mobile_gpu.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/tutorials/autotvm/tune_conv2d_cuda.html b/docs/tutorials/autotvm/tune_conv2d_cuda.html
index aa3b464..7ddd5f1 100644
--- a/docs/tutorials/autotvm/tune_conv2d_cuda.html
+++ b/docs/tutorials/autotvm/tune_conv2d_cuda.html
@@ -399,26 +399,26 @@ for this template</p>
    7 unroll_explicit: OtherOption([0, 1]) len=2
 )
 Get devices for measurement successfully!
-No: 1   GFLOPS: 27.74/27.74     result: MeasureResult(costs=(0.00834437025,), error_no=0, all_cost=2.8114962577819824, timestamp=1600569712.3554745)    [(&#39;tile_f&#39;, [-1, 32, 1, 2]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 2, 1]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 1)],None,7166780
-No: 2   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 3   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 4   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 5   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 6   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 7   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 8   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 9   GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 10  GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 11  GFLOPS: 0.00/27.74      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 12  GFLOPS: 46.00/46.00     result: MeasureResult(costs=(0.005032405272727272,), error_no=0, all_cost=2.980912923812866, timestamp=1600569724.2990882)      [(&#39;tile_f&#39;, [-1, 2, 8, 2]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 1, 32]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 0)],None,2077980
-No: 13  GFLOPS: 0.00/46.00      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 14  GFLOPS: 73.67/73.67     result: MeasureResult(costs=(0.00314226628125,), error_no=0, all_cost=1.9068870544433594, timestamp=1600569726.168562)  [(&#39;tile_f&#39;, [-1, 2, 16, 8]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 16, 1]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,8726459
-No: 15  GFLOPS: 27.61/73.67     result: MeasureResult(costs=(0.008385226583333334,), error_no=0, all_cost=1.822760820388794, timestamp=1600569727.4046636)      [(&#39;tile_f&#39;, [-1, 1, 2, 64]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 1, 8]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,5905444
-No: 16  GFLOPS: 1.61/73.67      result: MeasureResult(costs=(0.14341517075,), error_no=0, all_cost=4.74316668510437, timestamp=1600569730.2062669)      [(&#39;tile_f&#39;, [-1, 2, 8, 8]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 2, 4]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 1)],None,7428895
-No: 17  GFLOPS: 0.00/73.67      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (3) /workspace/build/libtvm.so(+0x688a67) [0x7f3a68ee4a67]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f3a68ee4506]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
-No: 18  GFLOPS: 0.00/73.67      result: MeasureResult(costs=(RuntimeError(&#39;Traceback (most recent call last):\n  [bt] (5) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (4) /workspace/build/libtvm.so(+0x11d5e42) [0x7f3a69a31e42]\n  [bt] (3) /workspace/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x26b) [0x7f3a69a3343b]\n  [bt] (2) /workspace/build/libtvm.so(tvm::runtime::RPCClientSession::Call [...]
-No: 19  GFLOPS: 23.75/73.67     result: MeasureResult(costs=(0.00974691818181818,), error_no=0, all_cost=1.496830701828003, timestamp=1600569739.6011925)       [(&#39;tile_f&#39;, [-1, 2, 1, 32]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 1]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 0)],None,782066
-No: 20  GFLOPS: 0.00/73.67      result: MeasureResult(costs=(RuntimeError(&#39;Traceback (most recent call last):\n  [bt] (5) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f3a699fc061]\n  [bt] (4) /workspace/build/libtvm.so(+0x11d5e42) [0x7f3a69a31e42]\n  [bt] (3) /workspace/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x26b) [0x7f3a69a3343b]\n  [bt] (2) /workspace/build/libtvm.so(tvm::runtime::RPCClientSession::Call [...]
+No: 1   GFLOPS: 27.65/27.65     result: MeasureResult(costs=(0.008373972333333334,), error_no=0, all_cost=2.972865104675293, timestamp=1600758792.8104715)      [(&#39;tile_f&#39;, [-1, 32, 1, 2]), (&#39;tile_y&#39;, [-1, 1, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 2, 1]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 1)],None,7166780
+No: 2   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 3   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 4   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 5   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 6   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 7   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 8   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 9   GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 10  GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 11  GFLOPS: 0.00/27.65      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 12  GFLOPS: 49.34/49.34     result: MeasureResult(costs=(0.0046919395,), error_no=0, all_cost=2.9751179218292236, timestamp=1600758802.322572)      [(&#39;tile_f&#39;, [-1, 2, 8, 2]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 1, 32]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 0)],None,2077980
+No: 13  GFLOPS: 0.00/49.34      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 14  GFLOPS: 73.63/73.63     result: MeasureResult(costs=(0.00314397021875,), error_no=0, all_cost=2.0785014629364014, timestamp=1600758803.7681682) [(&#39;tile_f&#39;, [-1, 2, 16, 8]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 16, 1]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,8726459
+No: 15  GFLOPS: 27.59/73.63     result: MeasureResult(costs=(0.008390514333333333,), error_no=0, all_cost=1.9435279369354248, timestamp=1600758804.8366861)     [(&#39;tile_f&#39;, [-1, 1, 2, 64]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 1, 7, 1]), (&#39;tile_rc&#39;, [-1, 1, 8]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 1)],None,5905444
+No: 16  GFLOPS: 1.61/73.63      result: MeasureResult(costs=(0.14342895375,), error_no=0, all_cost=4.801366806030273, timestamp=1600758807.475919)      [(&#39;tile_f&#39;, [-1, 2, 8, 8]), (&#39;tile_y&#39;, [-1, 1, 1, 7]), (&#39;tile_x&#39;, [-1, 7, 1, 1]), (&#39;tile_rc&#39;, [-1, 2, 4]), (&#39;tile_ry&#39;, [-1, 1, 3]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 512), (&#39;unroll_explicit&#39;, 1)],None,7428895
+No: 17  GFLOPS: 0.00/73.63      result: MeasureResult(costs=(InstantiationError(&#39;Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (3) /workspace/build/libtvm.so(+0x688b27) [0x7f23334feb27]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&amp;) const+0x3e6) [0x7f23334fe5c6]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::Prim [...]
+No: 18  GFLOPS: 0.00/73.63      result: MeasureResult(costs=(RuntimeError(&#39;Traceback (most recent call last):\n  [bt] (5) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (4) /workspace/build/libtvm.so(+0x11d6002) [0x7f233404c002]\n  [bt] (3) /workspace/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x26b) [0x7f233404d5fb]\n  [bt] (2) /workspace/build/libtvm.so(tvm::runtime::RPCClientSession::Call [...]
+No: 19  GFLOPS: 23.75/73.63     result: MeasureResult(costs=(0.009745905545454545,), error_no=0, all_cost=1.7060630321502686, timestamp=1600758815.8534865)     [(&#39;tile_f&#39;, [-1, 2, 1, 32]), (&#39;tile_y&#39;, [-1, 1, 7, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 4, 1]), (&#39;tile_ry&#39;, [-1, 3, 1]), (&#39;tile_rx&#39;, [-1, 3, 1]), (&#39;auto_unroll_max_step&#39;, 0), (&#39;unroll_explicit&#39;, 0)],None,782066
+No: 20  GFLOPS: 0.00/73.63      result: MeasureResult(costs=(RuntimeError(&#39;Traceback (most recent call last):\n  [bt] (5) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f2334016221]\n  [bt] (4) /workspace/build/libtvm.so(+0x11d6002) [0x7f233404c002]\n  [bt] (3) /workspace/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x26b) [0x7f233404d5fb]\n  [bt] (2) /workspace/build/libtvm.so(tvm::runtime::RPCClientSession::Call [...]
 </pre></div>
 </div>
 <p>Finally we can inspect the best config from log file, check correctness,
@@ -457,7 +457,7 @@ and measure running time.</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Best config:
 [(&#39;tile_f&#39;, [-1, 2, 16, 8]), (&#39;tile_y&#39;, [-1, 7, 1, 1]), (&#39;tile_x&#39;, [-1, 1, 1, 1]), (&#39;tile_rc&#39;, [-1, 16, 1]), (&#39;tile_ry&#39;, [-1, 1, 1]), (&#39;tile_rx&#39;, [-1, 1, 1]), (&#39;auto_unroll_max_step&#39;, 1500), (&#39;unroll_explicit&#39;, 1)],None,8726459
-Time cost of this operator: 0.003474
+Time cost of this operator: 0.003496
 </pre></div>
 </div>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorials-autotvm-tune-conv2d-cuda-py">
diff --git a/docs/tutorials/autotvm/tune_simple_template.html b/docs/tutorials/autotvm/tune_simple_template.html
index 9ed50b5..220f350 100644
--- a/docs/tutorials/autotvm/tune_simple_template.html
+++ b/docs/tutorials/autotvm/tune_simple_template.html
@@ -492,16 +492,16 @@ used to get the best config later.</p>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Get devices for measurement successfully!
-No: 1   GFLOPS: 9.96/9.96       result: MeasureResult(costs=(0.0269546244,), error_no=0, all_cost=0.927971601486206, timestamp=1600569684.2742639)      [(&#39;tile_y&#39;, [-1, 8]), (&#39;tile_x&#39;, [-1, 32])],None,53
-No: 2   GFLOPS: 12.65/12.65     result: MeasureResult(costs=(0.0212222168,), error_no=0, all_cost=1.3642914295196533, timestamp=1600569685.455414)      [(&#39;tile_y&#39;, [-1, 128]), (&#39;tile_x&#39;, [-1, 256])],None,87
-No: 3   GFLOPS: 15.22/15.22     result: MeasureResult(costs=(0.017637051,), error_no=0, all_cost=1.055131435394287, timestamp=1600569686.5537252)       [(&#39;tile_y&#39;, [-1, 8]), (&#39;tile_x&#39;, [-1, 512])],None,93
-No: 4   GFLOPS: 13.09/15.22     result: MeasureResult(costs=(0.0205073562,), error_no=0, all_cost=1.1743948459625244, timestamp=1600569687.7479143)     [(&#39;tile_y&#39;, [-1, 128]), (&#39;tile_x&#39;, [-1, 512])],None,97
-No: 5   GFLOPS: 2.02/15.22      result: MeasureResult(costs=(0.1327765914,), error_no=0, all_cost=2.800631523132324, timestamp=1600569690.720167)       [(&#39;tile_y&#39;, [-1, 256]), (&#39;tile_x&#39;, [-1, 4])],None,28
-No: 6   GFLOPS: 8.86/15.22      result: MeasureResult(costs=(0.030290215199999998,), error_no=0, all_cost=1.296706199645996, timestamp=1600569692.0395942)      [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 32])],None,52
-No: 7   GFLOPS: 13.83/15.22     result: MeasureResult(costs=(0.0194060902,), error_no=0, all_cost=0.917316198348999, timestamp=1600569693.2027879)      [(&#39;tile_y&#39;, [-1, 2]), (&#39;tile_x&#39;, [-1, 512])],None,91
-No: 8   GFLOPS: 11.73/15.22     result: MeasureResult(costs=(0.022882502800000003,), error_no=0, all_cost=1.1356632709503174, timestamp=1600569694.3774312)     [(&#39;tile_y&#39;, [-1, 2]), (&#39;tile_x&#39;, [-1, 256])],None,81
-No: 9   GFLOPS: 0.92/15.22      result: MeasureResult(costs=(0.291634,), error_no=0, all_cost=5.390047550201416, timestamp=1600569701.2650788)  [(&#39;tile_y&#39;, [-1, 128]), (&#39;tile_x&#39;, [-1, 2])],None,17
-No: 10  GFLOPS: 1.19/15.22      result: MeasureResult(costs=(0.22557004639999997,), error_no=0, all_cost=4.4250712394714355, timestamp=1600569705.7239919)      [(&#39;tile_y&#39;, [-1, 1]), (&#39;tile_x&#39;, [-1, 2])],None,10
+No: 1   GFLOPS: 9.77/9.77       result: MeasureResult(costs=(0.0274771002,), error_no=0, all_cost=1.644925594329834, timestamp=1600758767.2463927)      [(&#39;tile_y&#39;, [-1, 8]), (&#39;tile_x&#39;, [-1, 32])],None,53
+No: 2   GFLOPS: 12.61/12.61     result: MeasureResult(costs=(0.021286801799999998,), error_no=0, all_cost=0.8907041549682617, timestamp=1600758768.201701)      [(&#39;tile_y&#39;, [-1, 128]), (&#39;tile_x&#39;, [-1, 256])],None,87
+No: 3   GFLOPS: 15.62/15.62     result: MeasureResult(costs=(0.0171844216,), error_no=0, all_cost=1.0766642093658447, timestamp=1600758769.0675318)     [(&#39;tile_y&#39;, [-1, 8]), (&#39;tile_x&#39;, [-1, 512])],None,93
+No: 4   GFLOPS: 13.08/15.62     result: MeasureResult(costs=(0.0205201322,), error_no=0, all_cost=1.000988483428955, timestamp=1600758769.9721282)      [(&#39;tile_y&#39;, [-1, 128]), (&#39;tile_x&#39;, [-1, 512])],None,97
+No: 5   GFLOPS: 1.99/15.62      result: MeasureResult(costs=(0.13515661699999998,), error_no=0, all_cost=3.0966920852661133, timestamp=1600758772.747954)       [(&#39;tile_y&#39;, [-1, 256]), (&#39;tile_x&#39;, [-1, 4])],None,28
+No: 6   GFLOPS: 8.94/15.62      result: MeasureResult(costs=(0.030038061999999997,), error_no=0, all_cost=1.2703056335449219, timestamp=1600758773.7996676)     [(&#39;tile_y&#39;, [-1, 4]), (&#39;tile_x&#39;, [-1, 32])],None,52
+No: 7   GFLOPS: 13.82/15.62     result: MeasureResult(costs=(0.0194217882,), error_no=0, all_cost=1.1055903434753418, timestamp=1600758774.7236273)     [(&#39;tile_y&#39;, [-1, 2]), (&#39;tile_x&#39;, [-1, 512])],None,91
+No: 8   GFLOPS: 12.04/15.62     result: MeasureResult(costs=(0.022300372800000003,), error_no=0, all_cost=1.1945393085479736, timestamp=1600758775.6967962)     [(&#39;tile_y&#39;, [-1, 2]), (&#39;tile_x&#39;, [-1, 256])],None,81
+No: 9   GFLOPS: 0.92/15.62      result: MeasureResult(costs=(0.2910877202,), error_no=0, all_cost=5.471558094024658, timestamp=1600758782.301078)       [(&#39;tile_y&#39;, [-1, 128]), (&#39;tile_x&#39;, [-1, 2])],None,17
+No: 10  GFLOPS: 1.21/15.62      result: MeasureResult(costs=(0.221818162,), error_no=0, all_cost=4.466007947921753, timestamp=1600758786.428074)        [(&#39;tile_y&#39;, [-1, 1]), (&#39;tile_x&#39;, [-1, 2])],None,10
 </pre></div>
 </div>
 <p>Finally we apply history best from the cache file and check its correctness.
diff --git a/docs/tutorials/dev/sg_execution_times.html b/docs/tutorials/dev/sg_execution_times.html
index e9d1435..a708f8b 100644
--- a/docs/tutorials/dev/sg_execution_times.html
+++ b/docs/tutorials/dev/sg_execution_times.html
@@ -192,10 +192,10 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-tutorials-dev-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:00.556</strong> total execution time for <strong>tutorials_dev</strong> files:</p>
+<p><strong>00:00.566</strong> total execution time for <strong>tutorials_dev</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:00.371</strong>: <a class="reference internal" href="use_pass_infra.html#sphx-glr-tutorials-dev-use-pass-infra-py"><span class="std std-ref">How to Use TVM Pass Infra</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_infra.py</span></code>)</p></li>
-<li><p><strong>00:00.186</strong>: <a class="reference internal" href="low_level_custom_pass.html#sphx-glr-tutorials-dev-low-level-custom-pass-py"><span class="std std-ref">Writing a Customized Pass</span></a> (<code class="docutils literal notranslate"><span class="pre">low_level_custom_pass.py</span></code>)</p></li>
+<li><p><strong>00:00.376</strong>: <a class="reference internal" href="use_pass_infra.html#sphx-glr-tutorials-dev-use-pass-infra-py"><span class="std std-ref">How to Use TVM Pass Infra</span></a> (<code class="docutils literal notranslate"><span class="pre">use_pass_infra.py</span></code>)</p></li>
+<li><p><strong>00:00.189</strong>: <a class="reference internal" href="low_level_custom_pass.html#sphx-glr-tutorials-dev-low-level-custom-pass-py"><span class="std std-ref">Writing a Customized Pass</span></a> (<code class="docutils literal notranslate"><span class="pre">low_level_custom_pass.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/tutorials/frontend/deploy_model_on_android.html b/docs/tutorials/frontend/deploy_model_on_android.html
index 0379ac6..9fe98f5 100644
--- a/docs/tutorials/frontend/deploy_model_on_android.html
+++ b/docs/tutorials/frontend/deploy_model_on_android.html
@@ -533,7 +533,7 @@ to the remote android device.</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>TVM prediction top-1: tiger cat
 Evaluate inference time cost...
-Mean inference time (std dev): 13.37 ms (0.12 ms)
+Mean inference time (std dev): 14.01 ms (1.59 ms)
 </pre></div>
 </div>
 </div>
diff --git a/docs/tutorials/frontend/deploy_object_detection_pytorch.html b/docs/tutorials/frontend/deploy_object_detection_pytorch.html
index 36946b2..e6bd036 100644
--- a/docs/tutorials/frontend/deploy_object_detection_pytorch.html
+++ b/docs/tutorials/frontend/deploy_object_detection_pytorch.html
@@ -377,7 +377,7 @@ torchvision rcnn models.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Get 9 valid boxes
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  47.919 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  48.525 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorials-frontend-deploy-object-detection-pytorch-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/ec94e7a109437cf90cddcc60a7b5aaea/deploy_object_detection_pytorch.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_object_detection_pytorch.py</span></code></a></p>
diff --git a/docs/tutorials/frontend/deploy_prequantized.html b/docs/tutorials/frontend/deploy_prequantized.html
index d1c8187..62b20a3 100644
--- a/docs/tutorials/frontend/deploy_prequantized.html
+++ b/docs/tutorials/frontend/deploy_prequantized.html
@@ -433,7 +433,7 @@ output values are identical out of 1000 outputs from mobilenet v2.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Elapsed average ms: 19.295815570000002
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Elapsed average ms: 19.28641578
 </pre></div>
 </div>
 <div class="admonition note">
diff --git a/docs/tutorials/frontend/deploy_prequantized_tflite.html b/docs/tutorials/frontend/deploy_prequantized_tflite.html
index fade91a..c13f599 100644
--- a/docs/tutorials/frontend/deploy_prequantized_tflite.html
+++ b/docs/tutorials/frontend/deploy_prequantized_tflite.html
@@ -445,7 +445,7 @@ TFLite Top-5 labels: [387 102 386 341 880]
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Elapsed average ms: 36.139228329999995
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Elapsed average ms: 36.101913149999994
 </pre></div>
 </div>
 <div class="admonition note">
@@ -472,7 +472,7 @@ device and follow <a class="reference external" href="https://tvm.apache.org/doc
 </ul>
 </div></blockquote>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  38.399 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 2 minutes  37.247 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorials-frontend-deploy-prequantized-tflite-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/5c443f88ea44ce77c5ccade429af6e74/deploy_prequantized_tflite.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_prequantized_tflite.py</span></code></a></p>
diff --git a/docs/tutorials/frontend/deploy_ssd_gluoncv.html b/docs/tutorials/frontend/deploy_ssd_gluoncv.html
index d303971..2b6d61c 100644
--- a/docs/tutorials/frontend/deploy_ssd_gluoncv.html
+++ b/docs/tutorials/frontend/deploy_ssd_gluoncv.html
@@ -453,7 +453,7 @@ Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_
 </pre></div>
 </div>
 <img alt="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" class="sphx-glr-single-img" src="../../_images/sphx_glr_deploy_ssd_gluoncv_001.png" />
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  54.180 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> ( 1 minutes  55.271 seconds)</p>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorials-frontend-deploy-ssd-gluoncv-py">
 <div class="sphx-glr-download docutils container">
 <p><a class="reference download internal" download="" href="../../_downloads/ca08de6c440df207921d807474d26f06/deploy_ssd_gluoncv.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">deploy_ssd_gluoncv.py</span></code></a></p>
diff --git a/docs/tutorials/frontend/from_onnx.html b/docs/tutorials/frontend/from_onnx.html
index ece92ff..5becb54 100644
--- a/docs/tutorials/frontend/from_onnx.html
+++ b/docs/tutorials/frontend/from_onnx.html
@@ -313,9 +313,9 @@ we skip the pytorch model construction part, and download the saved onnx model</
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>...47%, 0.01 MB, 38 KB/s, 0 seconds passed
-...94%, 0.02 MB, 74 KB/s, 0 seconds passed
-...100%, 0.02 MB, 111 KB/s, 0 seconds passed
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>...47%, 0.01 MB, 214 KB/s, 0 seconds passed
+...94%, 0.02 MB, 361 KB/s, 0 seconds passed
+...100%, 0.02 MB, 539 KB/s, 0 seconds passed
 Cannot find config for target=llvm -keys=cpu, workload=(&#39;conv2d_NCHWc.x86&#39;, (&#39;TENSOR&#39;, (1, 32, 224, 224), &#39;float32&#39;), (&#39;TENSOR&#39;, (9, 32, 3, 3), &#39;float32&#39;), (1, 1), (1, 1, 1, 1), (1, 1), &#39;NCHW&#39;, &#39;NCHW&#39;, &#39;float32&#39;). A fallback configuration is used, which may bring great performance regression.
 Cannot find config for target=llvm -keys=cpu, workload=(&#39;conv2d_NCHWc.x86&#39;, (&#39;TENSOR&#39;, (1, 64, 224, 224), &#39;float32&#39;), (&#39;TENSOR&#39;, (32, 64, 3, 3), &#39;float32&#39;), (1, 1), (1, 1, 1, 1), (1, 1), &#39;NCHW&#39;, &#39;NCHW&#39;, &#39;float32&#39;). A fallback configuration is used, which may bring great performance regression.
 Cannot find config for target=llvm -keys=cpu, workload=(&#39;conv2d_NCHWc.x86&#39;, (&#39;TENSOR&#39;, (1, 1, 224, 224), &#39;float32&#39;), (&#39;TENSOR&#39;, (64, 1, 5, 5), &#39;float32&#39;), (1, 1), (2, 2, 2, 2), (1, 1), &#39;NCHW&#39;, &#39;NCHW&#39;, &#39;float32&#39;). A fallback configuration is used, which may bring great performance regression.
diff --git a/docs/tutorials/frontend/sg_execution_times.html b/docs/tutorials/frontend/sg_execution_times.html
index 1049037..39be385 100644
--- a/docs/tutorials/frontend/sg_execution_times.html
+++ b/docs/tutorials/frontend/sg_execution_times.html
@@ -192,27 +192,27 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-tutorials-frontend-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>10:35.452</strong> total execution time for <strong>tutorials_frontend</strong> files:</p>
+<p><strong>10:36.775</strong> total execution time for <strong>tutorials_frontend</strong> files:</p>
 <ul class="simple">
-<li><p><strong>02:38.399</strong>: <a class="reference internal" href="deploy_prequantized_tflite.html#sphx-glr-tutorials-frontend-deploy-prequantized-tflite-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized_tflite.py</span></code>)</p></li>
-<li><p><strong>01:54.180</strong>: <a class="reference internal" href="deploy_ssd_gluoncv.html#sphx-glr-tutorials-frontend-deploy-ssd-gluoncv-py"><span class="std std-ref">Deploy Single Shot Multibox Detector(SSD) model</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_ssd_gluoncv.py</span></code>)</p></li>
-<li><p><strong>01:47.919</strong>: <a class="reference internal" href="deploy_object_detection_pytorch.html#sphx-glr-tutorials-frontend-deploy-object-detection-pytorch-py"><span class="std std-ref">Compile PyTorch Object Detection Models</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_object_detection_pytorch.py</span></code>)</p></li>
-<li><p><strong>00:38.947</strong>: <a class="reference internal" href="deploy_prequantized.html#sphx-glr-tutorials-frontend-deploy-prequantized-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized.py</span></code>)</p></li>
-<li><p><strong>00:37.533</strong>: <a class="reference internal" href="from_tensorflow.html#sphx-glr-tutorials-frontend-from-tensorflow-py"><span class="std std-ref">Compile Tensorflow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tensorflow.py</span></code>)</p></li>
-<li><p><strong>00:31.184</strong>: <a class="reference internal" href="deploy_quantized.html#sphx-glr-tutorials-frontend-deploy-quantized-py"><span class="std std-ref">Deploy a Quantized Model on Cuda</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_quantized.py</span></code>)</p></li>
-<li><p><strong>00:26.233</strong>: <a class="reference internal" href="from_tflite.html#sphx-glr-tutorials-frontend-from-tflite-py"><span class="std std-ref">Compile TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tflite.py</span></code>)</p></li>
-<li><p><strong>00:22.877</strong>: <a class="reference internal" href="from_darknet.html#sphx-glr-tutorials-frontend-from-darknet-py"><span class="std std-ref">Compile YOLO-V2 and YOLO-V3 in DarkNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_darknet.py</span></code>)</p></li>
-<li><p><strong>00:16.766</strong>: <a class="reference internal" href="from_caffe2.html#sphx-glr-tutorials-frontend-from-caffe2-py"><span class="std std-ref">Compile Caffe2 Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_caffe2.py</span></code>)</p></li>
-<li><p><strong>00:15.025</strong>: <a class="reference internal" href="deploy_model_on_rasp.html#sphx-glr-tutorials-frontend-deploy-model-on-rasp-py"><span class="std std-ref">Deploy the Pretrained Model on Raspberry Pi</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_rasp.py</span></code>)</p></li>
-<li><p><strong>00:14.010</strong>: <a class="reference internal" href="deploy_model_on_android.html#sphx-glr-tutorials-frontend-deploy-model-on-android-py"><span class="std std-ref">Deploy the Pretrained Model on Android</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_android.py</span></code>)</p></li>
-<li><p><strong>00:11.676</strong>: <a class="reference internal" href="from_keras.html#sphx-glr-tutorials-frontend-from-keras-py"><span class="std std-ref">Compile Keras Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_keras.py</span></code>)</p></li>
-<li><p><strong>00:11.624</strong>: <a class="reference internal" href="from_pytorch.html#sphx-glr-tutorials-frontend-from-pytorch-py"><span class="std std-ref">Compile PyTorch Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_pytorch.py</span></code>)</p></li>
-<li><p><strong>00:09.566</strong>: <a class="reference internal" href="from_coreml.html#sphx-glr-tutorials-frontend-from-coreml-py"><span class="std std-ref">Compile CoreML Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_coreml.py</span></code>)</p></li>
-<li><p><strong>00:08.809</strong>: <a class="reference internal" href="from_mxnet.html#sphx-glr-tutorials-frontend-from-mxnet-py"><span class="std std-ref">Compile MXNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_mxnet.py</span></code>)</p></li>
-<li><p><strong>00:05.507</strong>: <a class="reference internal" href="build_gcn.html#sphx-glr-tutorials-frontend-build-gcn-py"><span class="std std-ref">Building a Graph Convolutional Network</span></a> (<code class="docutils literal notranslate"><span class="pre">build_gcn.py</span></code>)</p></li>
-<li><p><strong>00:03.012</strong>: <a class="reference internal" href="using_external_lib.html#sphx-glr-tutorials-frontend-using-external-lib-py"><span class="std std-ref">Using External Libraries in Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_external_lib.py</span></code>)</p></li>
-<li><p><strong>00:02.012</strong>: <a class="reference internal" href="from_onnx.html#sphx-glr-tutorials-frontend-from-onnx-py"><span class="std std-ref">Compile ONNX Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_onnx.py</span></code>)</p></li>
-<li><p><strong>00:00.173</strong>: <a class="reference internal" href="deploy_sparse.html#sphx-glr-tutorials-frontend-deploy-sparse-py"><span class="std std-ref">Deploy a Hugging Face Pruned Model on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_sparse.py</span></code>)</p></li>
+<li><p><strong>02:37.247</strong>: <a class="reference internal" href="deploy_prequantized_tflite.html#sphx-glr-tutorials-frontend-deploy-prequantized-tflite-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized_tflite.py</span></code>)</p></li>
+<li><p><strong>01:55.271</strong>: <a class="reference internal" href="deploy_ssd_gluoncv.html#sphx-glr-tutorials-frontend-deploy-ssd-gluoncv-py"><span class="std std-ref">Deploy Single Shot Multibox Detector(SSD) model</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_ssd_gluoncv.py</span></code>)</p></li>
+<li><p><strong>01:48.525</strong>: <a class="reference internal" href="deploy_object_detection_pytorch.html#sphx-glr-tutorials-frontend-deploy-object-detection-pytorch-py"><span class="std std-ref">Compile PyTorch Object Detection Models</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_object_detection_pytorch.py</span></code>)</p></li>
+<li><p><strong>00:39.296</strong>: <a class="reference internal" href="deploy_prequantized.html#sphx-glr-tutorials-frontend-deploy-prequantized-py"><span class="std std-ref">Deploy a Framework-prequantized Model with TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_prequantized.py</span></code>)</p></li>
+<li><p><strong>00:37.281</strong>: <a class="reference internal" href="from_tensorflow.html#sphx-glr-tutorials-frontend-from-tensorflow-py"><span class="std std-ref">Compile Tensorflow Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tensorflow.py</span></code>)</p></li>
+<li><p><strong>00:31.319</strong>: <a class="reference internal" href="deploy_quantized.html#sphx-glr-tutorials-frontend-deploy-quantized-py"><span class="std std-ref">Deploy a Quantized Model on Cuda</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_quantized.py</span></code>)</p></li>
+<li><p><strong>00:26.102</strong>: <a class="reference internal" href="from_tflite.html#sphx-glr-tutorials-frontend-from-tflite-py"><span class="std std-ref">Compile TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_tflite.py</span></code>)</p></li>
+<li><p><strong>00:22.942</strong>: <a class="reference internal" href="from_darknet.html#sphx-glr-tutorials-frontend-from-darknet-py"><span class="std std-ref">Compile YOLO-V2 and YOLO-V3 in DarkNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_darknet.py</span></code>)</p></li>
+<li><p><strong>00:16.702</strong>: <a class="reference internal" href="from_caffe2.html#sphx-glr-tutorials-frontend-from-caffe2-py"><span class="std std-ref">Compile Caffe2 Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_caffe2.py</span></code>)</p></li>
+<li><p><strong>00:15.132</strong>: <a class="reference internal" href="deploy_model_on_rasp.html#sphx-glr-tutorials-frontend-deploy-model-on-rasp-py"><span class="std std-ref">Deploy the Pretrained Model on Raspberry Pi</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_rasp.py</span></code>)</p></li>
+<li><p><strong>00:14.016</strong>: <a class="reference internal" href="deploy_model_on_android.html#sphx-glr-tutorials-frontend-deploy-model-on-android-py"><span class="std std-ref">Deploy the Pretrained Model on Android</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_model_on_android.py</span></code>)</p></li>
+<li><p><strong>00:11.667</strong>: <a class="reference internal" href="from_pytorch.html#sphx-glr-tutorials-frontend-from-pytorch-py"><span class="std std-ref">Compile PyTorch Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_pytorch.py</span></code>)</p></li>
+<li><p><strong>00:11.593</strong>: <a class="reference internal" href="from_keras.html#sphx-glr-tutorials-frontend-from-keras-py"><span class="std std-ref">Compile Keras Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_keras.py</span></code>)</p></li>
+<li><p><strong>00:09.698</strong>: <a class="reference internal" href="from_mxnet.html#sphx-glr-tutorials-frontend-from-mxnet-py"><span class="std std-ref">Compile MXNet Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_mxnet.py</span></code>)</p></li>
+<li><p><strong>00:09.579</strong>: <a class="reference internal" href="from_coreml.html#sphx-glr-tutorials-frontend-from-coreml-py"><span class="std std-ref">Compile CoreML Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_coreml.py</span></code>)</p></li>
+<li><p><strong>00:05.577</strong>: <a class="reference internal" href="build_gcn.html#sphx-glr-tutorials-frontend-build-gcn-py"><span class="std std-ref">Building a Graph Convolutional Network</span></a> (<code class="docutils literal notranslate"><span class="pre">build_gcn.py</span></code>)</p></li>
+<li><p><strong>00:02.951</strong>: <a class="reference internal" href="using_external_lib.html#sphx-glr-tutorials-frontend-using-external-lib-py"><span class="std std-ref">Using External Libraries in Relay</span></a> (<code class="docutils literal notranslate"><span class="pre">using_external_lib.py</span></code>)</p></li>
+<li><p><strong>00:01.709</strong>: <a class="reference internal" href="from_onnx.html#sphx-glr-tutorials-frontend-from-onnx-py"><span class="std std-ref">Compile ONNX Models</span></a> (<code class="docutils literal notranslate"><span class="pre">from_onnx.py</span></code>)</p></li>
+<li><p><strong>00:00.168</strong>: <a class="reference internal" href="deploy_sparse.html#sphx-glr-tutorials-frontend-deploy-sparse-py"><span class="std std-ref">Deploy a Hugging Face Pruned Model on CPU</span></a> (<code class="docutils literal notranslate"><span class="pre">deploy_sparse.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/tutorials/get_started/cross_compilation_and_rpc.html b/docs/tutorials/get_started/cross_compilation_and_rpc.html
index 247b3d9..585a0c6 100644
--- a/docs/tutorials/get_started/cross_compilation_and_rpc.html
+++ b/docs/tutorials/get_started/cross_compilation_and_rpc.html
@@ -378,7 +378,7 @@ device and returns the measured cost. Network overhead is excluded.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>1.179e-07 secs/op
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>1.186e-07 secs/op
 </pre></div>
 </div>
 </div>
diff --git a/docs/tutorials/get_started/relay_quick_start.html b/docs/tutorials/get_started/relay_quick_start.html
index d10a07f..bc4f30a 100644
--- a/docs/tutorials/get_started/relay_quick_start.html
+++ b/docs/tutorials/get_started/relay_quick_start.html
@@ -381,57 +381,57 @@ in this example. Then the machine code will be generated as the module library.<
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>...1%, 0.01 MB, 37 KB/s, 0 seconds passed
-...3%, 0.02 MB, 72 KB/s, 0 seconds passed
-...5%, 0.02 MB, 108 KB/s, 0 seconds passed
-...7%, 0.03 MB, 143 KB/s, 0 seconds passed
-...9%, 0.04 MB, 178 KB/s, 0 seconds passed
-...11%, 0.05 MB, 208 KB/s, 0 seconds passed
-...13%, 0.05 MB, 241 KB/s, 0 seconds passed
-...15%, 0.06 MB, 275 KB/s, 0 seconds passed
-...17%, 0.07 MB, 308 KB/s, 0 seconds passed
-...19%, 0.08 MB, 341 KB/s, 0 seconds passed
-...21%, 0.09 MB, 374 KB/s, 0 seconds passed
-...23%, 0.09 MB, 400 KB/s, 0 seconds passed
-...25%, 0.10 MB, 434 KB/s, 0 seconds passed
-...27%, 0.11 MB, 467 KB/s, 0 seconds passed
-...29%, 0.12 MB, 499 KB/s, 0 seconds passed
-...31%, 0.12 MB, 530 KB/s, 0 seconds passed
-...33%, 0.13 MB, 563 KB/s, 0 seconds passed
-...35%, 0.14 MB, 594 KB/s, 0 seconds passed
-...37%, 0.15 MB, 627 KB/s, 0 seconds passed
-...39%, 0.16 MB, 657 KB/s, 0 seconds passed
-...41%, 0.16 MB, 690 KB/s, 0 seconds passed
-...43%, 0.17 MB, 720 KB/s, 0 seconds passed
-...45%, 0.18 MB, 751 KB/s, 0 seconds passed
-...47%, 0.19 MB, 783 KB/s, 0 seconds passed
-...49%, 0.20 MB, 805 KB/s, 0 seconds passed
-...51%, 0.20 MB, 835 KB/s, 0 seconds passed
-...53%, 0.21 MB, 862 KB/s, 0 seconds passed
-...55%, 0.22 MB, 894 KB/s, 0 seconds passed
-...57%, 0.23 MB, 922 KB/s, 0 seconds passed
-...59%, 0.23 MB, 954 KB/s, 0 seconds passed
-...61%, 0.24 MB, 979 KB/s, 0 seconds passed
-...63%, 0.25 MB, 1011 KB/s, 0 seconds passed
-...65%, 0.26 MB, 1039 KB/s, 0 seconds passed
-...67%, 0.27 MB, 1070 KB/s, 0 seconds passed
-...69%, 0.27 MB, 1097 KB/s, 0 seconds passed
-...71%, 0.28 MB, 1128 KB/s, 0 seconds passed
-...73%, 0.29 MB, 1150 KB/s, 0 seconds passed
-...75%, 0.30 MB, 1180 KB/s, 0 seconds passed
-...77%, 0.30 MB, 1205 KB/s, 0 seconds passed
-...79%, 0.31 MB, 1235 KB/s, 0 seconds passed
-...81%, 0.32 MB, 1262 KB/s, 0 seconds passed
-...83%, 0.33 MB, 1293 KB/s, 0 seconds passed
-...85%, 0.34 MB, 1317 KB/s, 0 seconds passed
-...87%, 0.34 MB, 1347 KB/s, 0 seconds passed
-...89%, 0.35 MB, 1373 KB/s, 0 seconds passed
-...91%, 0.36 MB, 1403 KB/s, 0 seconds passed
-...93%, 0.37 MB, 1426 KB/s, 0 seconds passed
-...95%, 0.38 MB, 1456 KB/s, 0 seconds passed
-...97%, 0.38 MB, 1481 KB/s, 0 seconds passed
-...99%, 0.39 MB, 1511 KB/s, 0 seconds passed
-...100%, 0.40 MB, 1540 KB/s, 0 seconds passed
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>...1%, 0.01 MB, 180 KB/s, 0 seconds passed
+...3%, 0.02 MB, 303 KB/s, 0 seconds passed
+...5%, 0.02 MB, 454 KB/s, 0 seconds passed
+...7%, 0.03 MB, 600 KB/s, 0 seconds passed
+...9%, 0.04 MB, 728 KB/s, 0 seconds passed
+...11%, 0.05 MB, 775 KB/s, 0 seconds passed
+...13%, 0.05 MB, 892 KB/s, 0 seconds passed
+...15%, 0.06 MB, 1006 KB/s, 0 seconds passed
+...17%, 0.07 MB, 1119 KB/s, 0 seconds passed
+...19%, 0.08 MB, 1241 KB/s, 0 seconds passed
+...21%, 0.09 MB, 1349 KB/s, 0 seconds passed
+...23%, 0.09 MB, 1365 KB/s, 0 seconds passed
+...25%, 0.10 MB, 1476 KB/s, 0 seconds passed
+...27%, 0.11 MB, 1582 KB/s, 0 seconds passed
+...29%, 0.12 MB, 1676 KB/s, 0 seconds passed
+...31%, 0.12 MB, 1785 KB/s, 0 seconds passed
+...33%, 0.13 MB, 1892 KB/s, 0 seconds passed
+...35%, 0.14 MB, 1986 KB/s, 0 seconds passed
+...37%, 0.15 MB, 2077 KB/s, 0 seconds passed
+...39%, 0.16 MB, 2183 KB/s, 0 seconds passed
+...41%, 0.16 MB, 2265 KB/s, 0 seconds passed
+...43%, 0.17 MB, 2369 KB/s, 0 seconds passed
+...45%, 0.18 MB, 2474 KB/s, 0 seconds passed
+...47%, 0.19 MB, 2559 KB/s, 0 seconds passed
+...49%, 0.20 MB, 2661 KB/s, 0 seconds passed
+...51%, 0.20 MB, 2628 KB/s, 0 seconds passed
+...53%, 0.21 MB, 2713 KB/s, 0 seconds passed
+...55%, 0.22 MB, 2811 KB/s, 0 seconds passed
+...57%, 0.23 MB, 2882 KB/s, 0 seconds passed
+...59%, 0.23 MB, 2978 KB/s, 0 seconds passed
+...61%, 0.24 MB, 3075 KB/s, 0 seconds passed
+...63%, 0.25 MB, 3172 KB/s, 0 seconds passed
+...65%, 0.26 MB, 3244 KB/s, 0 seconds passed
+...67%, 0.27 MB, 3339 KB/s, 0 seconds passed
+...69%, 0.27 MB, 3433 KB/s, 0 seconds passed
+...71%, 0.28 MB, 3529 KB/s, 0 seconds passed
+...73%, 0.29 MB, 3600 KB/s, 0 seconds passed
+...75%, 0.30 MB, 3693 KB/s, 0 seconds passed
+...77%, 0.30 MB, 3786 KB/s, 0 seconds passed
+...79%, 0.31 MB, 3880 KB/s, 0 seconds passed
+...81%, 0.32 MB, 3950 KB/s, 0 seconds passed
+...83%, 0.33 MB, 4042 KB/s, 0 seconds passed
+...85%, 0.34 MB, 4134 KB/s, 0 seconds passed
+...87%, 0.34 MB, 4227 KB/s, 0 seconds passed
+...89%, 0.35 MB, 4286 KB/s, 0 seconds passed
+...91%, 0.36 MB, 4378 KB/s, 0 seconds passed
+...93%, 0.37 MB, 4468 KB/s, 0 seconds passed
+...95%, 0.38 MB, 4559 KB/s, 0 seconds passed
+...97%, 0.38 MB, 4624 KB/s, 0 seconds passed
+...99%, 0.39 MB, 4713 KB/s, 0 seconds passed
+...100%, 0.40 MB, 4798 KB/s, 0 seconds passed
 Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32, workload=(&#39;dense_small_batch.cuda&#39;, (&#39;TENSOR&#39;, (1, 512), &#39;float32&#39;), (&#39;TENSOR&#39;, (1000, 512), &#39;float32&#39;), None, &#39;float32&#39;). A fallback configuration is used, which may bring great performance regression.
 </pre></div>
 </div>
diff --git a/docs/tutorials/get_started/sg_execution_times.html b/docs/tutorials/get_started/sg_execution_times.html
index d68a36f..a5d753c 100644
--- a/docs/tutorials/get_started/sg_execution_times.html
+++ b/docs/tutorials/get_started/sg_execution_times.html
@@ -192,11 +192,11 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-tutorials-get-started-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:16.580</strong> total execution time for <strong>tutorials_get_started</strong> files:</p>
+<p><strong>00:16.361</strong> total execution time for <strong>tutorials_get_started</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:16.096</strong>: <a class="reference internal" href="relay_quick_start.html#sphx-glr-tutorials-get-started-relay-quick-start-py"><span class="std std-ref">Quick Start Tutorial for Compiling Deep Learning Models</span></a> (<code class="docutils literal notranslate"><span class="pre">relay_quick_start.py</span></code>)</p></li>
+<li><p><strong>00:15.875</strong>: <a class="reference internal" href="relay_quick_start.html#sphx-glr-tutorials-get-started-relay-quick-start-py"><span class="std std-ref">Quick Start Tutorial for Compiling Deep Learning Models</span></a> (<code class="docutils literal notranslate"><span class="pre">relay_quick_start.py</span></code>)</p></li>
 <li><p><strong>00:00.350</strong>: <a class="reference internal" href="tensor_expr_get_started.html#sphx-glr-tutorials-get-started-tensor-expr-get-started-py"><span class="std std-ref">Get Started with Tensor Expression</span></a> (<code class="docutils literal notranslate"><span class="pre">tensor_expr_get_started.py</span></code>)</p></li>
-<li><p><strong>00:00.134</strong>: <a class="reference internal" href="cross_compilation_and_rpc.html#sphx-glr-tutorials-get-started-cross-compilation-and-rpc-py"><span class="std std-ref">Cross Compilation and RPC</span></a> (<code class="docutils literal notranslate"><span class="pre">cross_compilation_and_rpc.py</span></code>)</p></li>
+<li><p><strong>00:00.136</strong>: <a class="reference internal" href="cross_compilation_and_rpc.html#sphx-glr-tutorials-get-started-cross-compilation-and-rpc-py"><span class="std std-ref">Cross Compilation and RPC</span></a> (<code class="docutils literal notranslate"><span class="pre">cross_compilation_and_rpc.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/tutorials/language/schedule_primitives.html b/docs/tutorials/language/schedule_primitives.html
index 68ffe98..8d408be 100644
--- a/docs/tutorials/language/schedule_primitives.html
+++ b/docs/tutorials/language/schedule_primitives.html
@@ -261,13 +261,13 @@ schedule computes tensor in a serial manner in a row-major order.</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
-  buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type=&quot;auto&quot;),
-             B: Buffer(B_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type=&quot;auto&quot;),
+  buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type=&quot;auto&quot;),
+             C: Buffer(C_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type=&quot;auto&quot;),
              A: Buffer(A_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type=&quot;auto&quot;)}
   buffer_map = {A_1: A, B_1: B, C_1: C} {
   for (i: int32, 0, m) {
     for (j: int32, 0, n) {
-      C_2[((i*stride) + (j*stride_1))] = ((float32*)A_2[((i*stride_4) + (j*stride_5))]*(float32*)B_2[((i*stride_2) + (j*stride_3))])
+      C_2[((i*stride_2) + (j*stride_3))] = ((float32*)A_2[((i*stride_4) + (j*stride_5))]*(float32*)B_2[((i*stride) + (j*stride_1))])
     }
   }
 }
@@ -509,13 +509,13 @@ of computation of <cite>C</cite>.</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
-  buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32], [stride: int32], type=&quot;auto&quot;),
-             B: Buffer(B_2: Pointer(float32), float32, [m], [stride_1: int32], type=&quot;auto&quot;),
+  buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32], [stride: int32], type=&quot;auto&quot;),
+             C: Buffer(C_2: Pointer(float32), float32, [m], [stride_1: int32], type=&quot;auto&quot;),
              A: Buffer(A_2: Pointer(float32), float32, [m], [stride_2: int32], type=&quot;auto&quot;)}
   buffer_map = {A_1: A, B_1: B, C_1: C} {
   for (i: int32, 0, m) {
-    B_2[(i*stride_1)] = ((float32*)A_2[(i*stride_2)] + 1f32)
-    C_2[(i*stride)] = ((float32*)B_2[(i*stride_1)]*2f32)
+    B_2[(i*stride)] = ((float32*)A_2[(i*stride_2)] + 1f32)
+    C_2[(i*stride_1)] = ((float32*)B_2[(i*stride)]*2f32)
   }
 }
 </pre></div>
@@ -538,12 +538,12 @@ tensor is required.</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
-  buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32], [stride: int32], type=&quot;auto&quot;),
-             C: Buffer(C_2: Pointer(float32), float32, [m], [stride_1: int32], type=&quot;auto&quot;),
+  buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32], [stride: int32], type=&quot;auto&quot;),
+             B: Buffer(B_2: Pointer(float32), float32, [m], [stride_1: int32], type=&quot;auto&quot;),
              A: Buffer(A_2: Pointer(float32), float32, [m], [stride_2: int32], type=&quot;auto&quot;)}
   buffer_map = {A_1: A, B_1: B, C_1: C} {
   for (i: int32, 0, m) {
-    C_2[(i*stride_1)] = (((float32*)A_2[(i*stride_2)] + 1f32)*2f32)
+    C_2[(i*stride)] = (((float32*)A_2[(i*stride_2)] + 1f32)*2f32)
   }
 }
 </pre></div>
diff --git a/docs/tutorials/language/sg_execution_times.html b/docs/tutorials/language/sg_execution_times.html
index 82b1454..ef2ab4c 100644
--- a/docs/tutorials/language/sg_execution_times.html
+++ b/docs/tutorials/language/sg_execution_times.html
@@ -192,16 +192,16 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-tutorials-language-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:04.698</strong> total execution time for <strong>tutorials_language</strong> files:</p>
+<p><strong>00:04.710</strong> total execution time for <strong>tutorials_language</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:01.809</strong>: <a class="reference internal" href="intrin_math.html#sphx-glr-tutorials-language-intrin-math-py"><span class="std std-ref">Intrinsics and Math Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">intrin_math.py</span></code>)</p></li>
-<li><p><strong>00:00.864</strong>: <a class="reference internal" href="tensorize.html#sphx-glr-tutorials-language-tensorize-py"><span class="std std-ref">Use Tensorize to Leverage Hardware Intrinsics</span></a> (<code class="docutils literal notranslate"><span class="pre">tensorize.py</span></code>)</p></li>
+<li><p><strong>00:01.825</strong>: <a class="reference internal" href="intrin_math.html#sphx-glr-tutorials-language-intrin-math-py"><span class="std std-ref">Intrinsics and Math Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">intrin_math.py</span></code>)</p></li>
+<li><p><strong>00:00.870</strong>: <a class="reference internal" href="tensorize.html#sphx-glr-tutorials-language-tensorize-py"><span class="std std-ref">Use Tensorize to Leverage Hardware Intrinsics</span></a> (<code class="docutils literal notranslate"><span class="pre">tensorize.py</span></code>)</p></li>
 <li><p><strong>00:00.624</strong>: <a class="reference internal" href="scan.html#sphx-glr-tutorials-language-scan-py"><span class="std std-ref">Scan and Recurrent Kernel</span></a> (<code class="docutils literal notranslate"><span class="pre">scan.py</span></code>)</p></li>
 <li><p><strong>00:00.597</strong>: <a class="reference internal" href="reduction.html#sphx-glr-tutorials-language-reduction-py"><span class="std std-ref">Reduction</span></a> (<code class="docutils literal notranslate"><span class="pre">reduction.py</span></code>)</p></li>
 <li><p><strong>00:00.252</strong>: <a class="reference internal" href="extern_op.html#sphx-glr-tutorials-language-extern-op-py"><span class="std std-ref">External Tensor Functions</span></a> (<code class="docutils literal notranslate"><span class="pre">extern_op.py</span></code>)</p></li>
-<li><p><strong>00:00.212</strong>: <a class="reference internal" href="schedule_primitives.html#sphx-glr-tutorials-language-schedule-primitives-py"><span class="std std-ref">Schedule Primitives in TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">schedule_primitives.py</span></code>)</p></li>
-<li><p><strong>00:00.186</strong>: <a class="reference internal" href="tedd.html#sphx-glr-tutorials-language-tedd-py"><span class="std std-ref">Use Tensor Expression Debug Display (TEDD) for Visualization</span></a> (<code class="docutils literal notranslate"><span class="pre">tedd.py</span></code>)</p></li>
-<li><p><strong>00:00.154</strong>: <a class="reference internal" href="tuple_inputs.html#sphx-glr-tutorials-language-tuple-inputs-py"><span class="std std-ref">Compute and Reduce with Tuple Inputs</span></a> (<code class="docutils literal notranslate"><span class="pre">tuple_inputs.py</span></code>)</p></li>
+<li><p><strong>00:00.211</strong>: <a class="reference internal" href="schedule_primitives.html#sphx-glr-tutorials-language-schedule-primitives-py"><span class="std std-ref">Schedule Primitives in TVM</span></a> (<code class="docutils literal notranslate"><span class="pre">schedule_primitives.py</span></code>)</p></li>
+<li><p><strong>00:00.181</strong>: <a class="reference internal" href="tedd.html#sphx-glr-tutorials-language-tedd-py"><span class="std std-ref">Use Tensor Expression Debug Display (TEDD) for Visualization</span></a> (<code class="docutils literal notranslate"><span class="pre">tedd.py</span></code>)</p></li>
+<li><p><strong>00:00.150</strong>: <a class="reference internal" href="tuple_inputs.html#sphx-glr-tutorials-language-tuple-inputs-py"><span class="std std-ref">Compute and Reduce with Tuple Inputs</span></a> (<code class="docutils literal notranslate"><span class="pre">tuple_inputs.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/tutorials/language/tensorize.html b/docs/tutorials/language/tensorize.html
index 9689479..ec0c6ad 100644
--- a/docs/tutorials/language/tensorize.html
+++ b/docs/tutorials/language/tensorize.html
@@ -284,8 +284,8 @@ Thus we break down the matmul loops to make the innermost loops a (16x64) GEMV.<
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
-  buffers = {B: Buffer(B_2: Pointer(float32), float32, [512, 64], []),
-             C: Buffer(C_2: Pointer(float32), float32, [1024, 512], []),
+  buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 512], []),
+             B: Buffer(B_2: Pointer(float32), float32, [512, 64], []),
              A: Buffer(A_2: Pointer(float32), float32, [1024, 64], [])}
   buffer_map = {A_1: A, B_1: B, C_1: C} {
   for (i: int32, 0, 1024) {
@@ -421,8 +421,8 @@ The importing needs to happen before the tensorized GEMV being executed.</p>
              B: Buffer(B_2: Pointer(float32), float32, [512, 64], []),
              A: Buffer(A_2: Pointer(float32), float32, [1024, 64], [])}
   buffer_map = {A_1: A, B_1: B, C_1: C} {
-  attr [IterVar(i: int32, (nullptr), &quot;DataPar&quot;, &quot;&quot;)] &quot;pragma_import_llvm&quot; = &quot;; ModuleID = &#39;/tmp/tmpp_vwn3cy/input0.cc&#39;
-source_filename = &quot;/tmp/tmpp_vwn3cy/input0.cc&quot;
+  attr [IterVar(i: int32, (nullptr), &quot;DataPar&quot;, &quot;&quot;)] &quot;pragma_import_llvm&quot; = &quot;; ModuleID = &#39;/tmp/tmp4nmf4z66/input0.cc&#39;
+source_filename = &quot;/tmp/tmp4nmf4z66/input0.cc&quot;
 target datalayout = &quot;e-m:e-i64:64-f80:128-n8:16:32:64-S128&quot;
 target triple = &quot;x86_64-pc-linux-gnu&quot;
 
diff --git a/docs/tutorials/language/tuple_inputs.html b/docs/tutorials/language/tuple_inputs.html
index 7b22473..fe69c1e 100644
--- a/docs/tutorials/language/tuple_inputs.html
+++ b/docs/tutorials/language/tuple_inputs.html
@@ -248,14 +248,14 @@ together in the next schedule procedure.</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A0_1: handle, A1_1: handle, B.v0_1: handle, B.v1_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {B.v1: Buffer(B.v1_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type=&quot;auto&quot;),
-             B.v0: Buffer(B.v0_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type=&quot;auto&quot;),
-             A1: Buffer(A1_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type=&quot;auto&quot;),
+             A1: Buffer(A1_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type=&quot;auto&quot;),
+             B.v0: Buffer(B.v0_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type=&quot;auto&quot;),
              A0: Buffer(A0_2: Pointer(float32), float32, [m, n], [stride_6: int32, stride_7: int32], type=&quot;auto&quot;)}
   buffer_map = {A0_1: A0, A1_1: A1, B.v0_1: B.v0, B.v1_1: B.v1} {
   for (i: int32, 0, m) {
     for (j: int32, 0, n) {
-      B.v0_2[((i*stride_2) + (j*stride_3))] = ((float32*)A0_2[((i*stride_6) + (j*stride_7))] + 2f32)
-      B.v1_2[((i*stride) + (j*stride_1))] = ((float32*)A1_2[((i*stride_4) + (j*stride_5))]*3f32)
+      B.v0_2[((i*stride_4) + (j*stride_5))] = ((float32*)A0_2[((i*stride_6) + (j*stride_7))] + 2f32)
+      B.v1_2[((i*stride) + (j*stride_1))] = ((float32*)A1_2[((i*stride_2) + (j*stride_3))]*3f32)
     }
   }
 }
@@ -302,16 +302,16 @@ with <code class="xref py py-func docutils literal notranslate"><span class="pre
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(idx_1: handle, val_1: handle, T.v0_1: handle, T.v1_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
   buffers = {T.v0: Buffer(T.v0_2: Pointer(int32), int32, [m: int32], [stride: int32], type=&quot;auto&quot;),
-             val: Buffer(val_2: Pointer(int32), int32, [m, n: int32], [stride_1: int32, stride_2: int32], type=&quot;auto&quot;),
-             T.v1: Buffer(T.v1_2: Pointer(int32), int32, [m], [stride_3: int32], type=&quot;auto&quot;),
+             T.v1: Buffer(T.v1_2: Pointer(int32), int32, [m], [stride_1: int32], type=&quot;auto&quot;),
+             val: Buffer(val_2: Pointer(int32), int32, [m, n: int32], [stride_2: int32, stride_3: int32], type=&quot;auto&quot;),
              idx: Buffer(idx_2: Pointer(int32), int32, [m, n], [stride_4: int32, stride_5: int32], type=&quot;auto&quot;)}
   buffer_map = {idx_1: idx, val_1: val, T.v0_1: T.v0, T.v1_1: T.v1} {
   for (i: int32, 0, m) {
     T.v0_2[(i*stride)] = -1
-    T.v1_2[(i*stride_3)] = -2147483648
+    T.v1_2[(i*stride_1)] = -2147483648
     for (k: int32, 0, n) {
-      T.v0_2[(i*stride)] = @tir.if_then_else(((int32*)val_2[((i*stride_1) + (k*stride_2))] &lt;= (int32*)T.v1_2[(i*stride_3)]), (int32*)T.v0_2[(i*stride)], (int32*)idx_2[((i*stride_4) + (k*stride_5))], dtype=int32)
-      T.v1_2[(i*stride_3)] = @tir.if_then_else(((int32*)val_2[((i*stride_1) + (k*stride_2))] &lt;= (int32*)T.v1_2[(i*stride_3)]), (int32*)T.v1_2[(i*stride_3)], (int32*)val_2[((i*stride_1) + (k*stride_2))], dtype=int32)
+      T.v0_2[(i*stride)] = @tir.if_then_else(((int32*)val_2[((i*stride_2) + (k*stride_3))] &lt;= (int32*)T.v1_2[(i*stride_1)]), (int32*)T.v0_2[(i*stride)], (int32*)idx_2[((i*stride_4) + (k*stride_5))], dtype=int32)
+      T.v1_2[(i*stride_1)] = @tir.if_then_else(((int32*)val_2[((i*stride_2) + (k*stride_3))] &lt;= (int32*)T.v1_2[(i*stride_1)]), (int32*)T.v1_2[(i*stride_1)], (int32*)val_2[((i*stride_2) + (k*stride_3))], dtype=int32)
     }
   }
 }
@@ -344,8 +344,8 @@ in terms of operation.</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A0_1: handle, A1_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
-  buffers = {A1: Buffer(A1_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type=&quot;auto&quot;),
-             C: Buffer(C_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type=&quot;auto&quot;),
+  buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type=&quot;auto&quot;),
+             A1: Buffer(A1_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type=&quot;auto&quot;),
              A0: Buffer(A0_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type=&quot;auto&quot;)}
   buffer_map = {A0_1: A0, A1_1: A1, C_1: C} {
   attr [B.v0: Pointer(float32)] &quot;storage_scope&quot; = &quot;global&quot;;
@@ -358,7 +358,7 @@ in terms of operation.</p>
       B.v1[j] = ((float32*)A0_2[((i*stride_4) + (j*stride_5))]*3f32)
     }
     for (j_1: int32, 0, n) {
-      C_2[((i*stride_2) + (j_1*stride_3))] = ((float32*)A1_2[((i*stride) + (j_1*stride_1))] + (float32*)B.v0[j_1])
+      C_2[((i*stride) + (j_1*stride_1))] = ((float32*)A1_2[((i*stride_2) + (j_1*stride_3))] + (float32*)B.v0[j_1])
     }
   }
 }
diff --git a/docs/tutorials/micro/sg_execution_times.html b/docs/tutorials/micro/sg_execution_times.html
index e461195..fd7d8b2 100644
--- a/docs/tutorials/micro/sg_execution_times.html
+++ b/docs/tutorials/micro/sg_execution_times.html
@@ -192,9 +192,9 @@
             
   <div class="section" id="computation-times">
 <span id="sphx-glr-tutorials-micro-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Permalink to this headline">¶</a></h1>
-<p><strong>00:09.958</strong> total execution time for <strong>tutorials_micro</strong> files:</p>
+<p><strong>00:09.992</strong> total execution time for <strong>tutorials_micro</strong> files:</p>
 <ul class="simple">
-<li><p><strong>00:09.958</strong>: <a class="reference internal" href="micro_tflite.html#sphx-glr-tutorials-micro-micro-tflite-py"><span class="std std-ref">Micro TVM with TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tflite.py</span></code>)</p></li>
+<li><p><strong>00:09.992</strong>: <a class="reference internal" href="micro_tflite.html#sphx-glr-tutorials-micro-micro-tflite-py"><span class="std std-ref">Micro TVM with TFLite Models</span></a> (<code class="docutils literal notranslate"><span class="pre">micro_tflite.py</span></code>)</p></li>
 </ul>
 </div>
 
diff --git a/docs/tutorials/optimize/opt_conv_cuda.html b/docs/tutorials/optimize/opt_conv_cuda.html
index 590b308..319cabe 100644
--- a/docs/tutorials/optimize/opt_conv_cuda.html
+++ b/docs/tutorials/optimize/opt_conv_cuda.html
@@ -411,7 +411,7 @@ latency of convolution.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Convolution: 53.235596 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Convolution: 53.272718 ms
 </pre></div>
 </div>
 <div class="sphx-glr-footer class sphx-glr-footer-example docutils container" id="sphx-glr-download-tutorials-optimize-opt-conv-cuda-py">
diff --git a/docs/tutorials/optimize/opt_conv_tensorcore.html b/docs/tutorials/optimize/opt_conv_tensorcore.html
index 597ec4a..7afc7f0 100644
--- a/docs/tutorials/optimize/opt_conv_tensorcore.html
+++ b/docs/tutorials/optimize/opt_conv_tensorcore.html
@@ -561,8 +561,8 @@ The only thing we should do is to make sure all threads in a warp can call Tenso
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A_1: handle, W_1: handle, Conv_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
-  buffers = {W: Buffer(W_2: Pointer(float16), float16, [3, 3, 16, 32, 16, 16], []),
-             Conv: Buffer(Conv_2: Pointer(float32), float32, [16, 14, 14, 32, 16, 16], []),
+  buffers = {Conv: Buffer(Conv_2: Pointer(float32), float32, [16, 14, 14, 32, 16, 16], []),
+             W: Buffer(W_2: Pointer(float16), float16, [3, 3, 16, 32, 16, 16], []),
              A: Buffer(A_2: Pointer(float16), float16, [16, 14, 14, 16, 16, 16], [])}
   buffer_map = {A_1: A, W_1: W, Conv_1: Conv} {
   attr [IterVar(blockIdx.z: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;blockIdx.z&quot;)] &quot;thread_extent&quot; = 196;
@@ -664,8 +664,8 @@ by mapping the 2D convolution to tensor intrinsics</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A_1: handle, W_1: handle, Conv_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
-  buffers = {Conv: Buffer(Conv_2: Pointer(float32), float32, [16, 14, 14, 32, 16, 16], []),
-             W: Buffer(W_2: Pointer(float16), float16, [3, 3, 16, 32, 16, 16], []),
+  buffers = {W: Buffer(W_2: Pointer(float16), float16, [3, 3, 16, 32, 16, 16], []),
+             Conv: Buffer(Conv_2: Pointer(float32), float32, [16, 14, 14, 32, 16, 16], []),
              A: Buffer(A_2: Pointer(float16), float16, [16, 14, 14, 16, 16, 16], [])}
   buffer_map = {A_1: A, W_1: W, Conv_1: Conv} {
   attr [IterVar(blockIdx.z: int32, (nullptr), &quot;ThreadIndex&quot;, &quot;blockIdx.z&quot;)] &quot;thread_extent&quot; = 196;
@@ -750,7 +750,7 @@ be able to run on our build server</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>conv2d with tensor core: 11.952753 ms
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>conv2d with tensor core: 13.384932 ms
 </pre></div>
 </div>
 </div>
diff --git a/docs/tutorials/optimize/opt_gemm.html b/docs/tutorials/optimize/opt_gemm.html
index af367e3..f5363d2 100644
--- a/docs/tutorials/optimize/opt_gemm.html
+++ b/docs/tutorials/optimize/opt_gemm.html
@@ -308,8 +308,8 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Numpy running time: 0.008755
-Baseline: 3.522159
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Numpy running time: 0.012972
+Baseline: 3.358347
 </pre></div>
 </div>
 <p>In TVM, we can always inspect lower level IR to debug or optimize our schedule.
@@ -367,7 +367,7 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Opt1: 0.290302
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Opt1: 0.288556
 </pre></div>
 </div>
 <p>Here is the generated IR after blocking.</p>
@@ -431,7 +431,7 @@ we can use <cite>vectorize</cite> interface to hint the compiler this pattern, s
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Opt2: 0.323975
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Opt2: 0.323788
 </pre></div>
 </div>
 <p>Here is the generated IR after vectorization.</p>
@@ -441,8 +441,8 @@ we can use <cite>vectorize</cite> interface to hint the compiler this pattern, s
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
-  buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
-             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
+  buffers = {B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
+             C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
              A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}
   buffer_map = {A_1: A, B_1: B, C_1: C} {
   for (x.outer: int32, 0, 32) {
@@ -491,7 +491,7 @@ the access pattern for A matrix is more cache friendly.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Opt3: 0.112113
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Opt3: 0.112968
 </pre></div>
 </div>
 <p>Here is the generated IR after loop permutation.</p>
@@ -501,8 +501,8 @@ the access pattern for A matrix is more cache friendly.</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
-  buffers = {B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
-             C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
+  buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
+             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
              A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}
   buffer_map = {A_1: A, B_1: B, C_1: C} {
   for (x.outer: int32, 0, 32) {
@@ -567,7 +567,7 @@ the corresponding value from the packed array.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Opt4: 0.106036
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Opt4: 0.106808
 </pre></div>
 </div>
 <p>Here is the generated IR after array packing.</p>
@@ -577,8 +577,8 @@ the corresponding value from the packed array.</p>
 <p class="sphx-glr-script-out">Out:</p>
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>primfn(A_1: handle, B_1: handle, C_1: handle) -&gt; ()
   attr = {&quot;global_symbol&quot;: &quot;main&quot;, &quot;tir.noalias&quot;: True}
-  buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
-             B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
+  buffers = {B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], []),
+             C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
              A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], [])}
   buffer_map = {A_1: A, B_1: B, C_1: C} {
   attr [packedB: Pointer(float32)] &quot;storage_scope&quot; = &quot;global&quot;;
@@ -647,7 +647,7 @@ write to C when all the block results are ready.</p>
 </pre></div>
 </div>
 <p class="sphx-glr-script-out">Out:</p>
-<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre>Opt5: 0.097966
... 421 lines suppressed ...