You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by tq...@apache.org on 2022/06/30 19:29:15 UTC

[tvm-site] branch asf-site updated: deploying docs (apache/tvm@265030eea4cf0447b5744b759d763158373167a2)

This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/tvm-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new a71bd255d deploying docs (apache/tvm@265030eea4cf0447b5744b759d763158373167a2)
a71bd255d is described below

commit a71bd255d833bed4761a2e3f34aa9583d0f10461
Author: tvm-bot <95...@users.noreply.github.com>
AuthorDate: Thu Jun 30 19:29:07 2022 +0000

    deploying docs (apache/tvm@265030eea4cf0447b5744b759d763158373167a2)
---
 .../tune_relay_cuda.py                             |    6 +
 .../067cf39a44d9f315a39f8a7547c556d8/install.py    |    6 +
 .../tune_sparse_x86.py                             |    6 +
 .../deploy_sparse.ipynb                            |   11 +
 .../0e2f38fcb1a1fb3e636e5953aa600dee/from_mxnet.py |    6 +
 .../opt_gemm.ipynb                                 |   11 +
 .../tvmc_python.py                                 |    6 +
 .../reduction.ipynb                                |    2 +-
 .../from_paddle.py                                 |    6 +
 .../tune_network_arm.py                            |    6 +
 .../deploy_prequantized_tflite.ipynb               |   11 +
 .../intrin_math.ipynb                              |    2 +-
 .../deploy_model_on_android.py                     |    6 +
 .../tvmc_command_line_driver.py                    |    6 +
 .../from_tflite.ipynb                              |   11 +
 .../286e7f77f494a25312ac88e3f234822e/extern_op.py  |    6 +
 .../2a0982f8ca0176cb17713d28286536e4/reduction.py  |    6 +
 .../2a4c6a9cfa43e8afef159a2bf1b99108/install.ipynb |   11 +
 .../autotvm_relay_x86.ipynb                        |   11 +
 .../micro_tflite.py                                |    6 +
 .../introduction.py                                |    6 +
 .../tune_relay_arm.py                              |    6 +
 .../autotvm_matmul_x86.ipynb                       |   11 +
 .../3a9b1d387f618487c8ccf6b8b78ae179/intro_topi.py |    6 +
 .../from_coreml.py                                 |    6 +
 .../tensorize.ipynb                                |    2 +-
 .../opt_conv_cuda.py                               |    6 +
 .../relay_quick_start.ipynb                        |   11 +
 .../deploy_model_on_rasp.py                        |    6 +
 .../tensor_expr_get_started.py                     |    6 +
 .../428c6201e29ce74e73c6b41eee589f62/tensorize.py  |    6 +
 .../tensor_expr_get_started.ipynb                  |   11 +
 .../use_pass_infra.py                              |    6 +
 .../micro_ethosu.ipynb                             |   11 +
 .../deploy_prequantized_tflite.py                  |    6 +
 .../autotvm_relay_x86.py                           |    6 +
 .../micro_tflite.ipynb                             |   11 +
 .../tensor_ir_blitz_course.py                      |    6 +
 .../tune_relay_mobile_gpu.py                       |    6 +
 .../tune_network_mali.py                           |    6 +
 .../tune_relay_x86.py                              |    6 +
 .../tuple_inputs.py                                |    6 +
 .../tune_conv2d_cuda.py                            |    6 +
 .../tune_relay_mobile_gpu.ipynb                    |   11 +
 .../729378592a96230b4f7be71b44da43a4/scan.ipynb    |    2 +-
 .../tune_conv2d_cuda.ipynb                         |   11 +
 .../opt_conv_tensorcore.py                         |    6 +
 .../opt_conv_tensorcore.ipynb                      |   11 +
 .../cross_compilation_and_rpc.py                   |    6 +
 .../from_darknet.py                                |    6 +
 .../deploy_object_detection_pytorch.py             |    6 +
 .../deploy_quantized.py                            |    6 +
 .../micro_reference_vm.py                          |    6 +
 .../from_tensorflow.py                             |    6 +
 .../build_gcn.ipynb                                |    2 +-
 .../extern_op.ipynb                                |    2 +-
 .../opt_conv_cuda.ipynb                            |   11 +
 .../8c7d8fd6a4b93bcff1f5573943dd02f4/scan.py       |    6 +
 .../tvmc_python.ipynb                              |   11 +
 .../autotvm_matmul_x86.py                          |    6 +
 .../96137df89d8034b548f407123ec50ce9/opt_gemm.py   |    6 +
 .../deploy_sparse.py                               |    6 +
 .../micro_autotune.py                              |    6 +
 .../introduction.ipynb                             |   11 +
 .../tuple_inputs.ipynb                             |    2 +-
 .../from_tflite.py                                 |    6 +
 .../micro_ethosu.py                                |    6 +
 .../bring_your_own_datatypes.ipynb                 |   11 +
 .../schedule_primitives.ipynb                      |    2 +-
 .../tune_relay_arm.ipynb                           |   11 +
 .../deploy_prequantized.ipynb                      |   11 +
 .../c23f7654585d9b0fa2129e1765b2a8f2/from_keras.py |    6 +
 .../c253040abc62eace272e406b7e1a4df5/tedd.py       |    6 +
 .../low_level_custom_pass.py                       |    6 +
 .../using_relay_viz.py                             |    6 +
 .../relay_quick_start.py                           |    6 +
 .../deploy_ssd_gluoncv.py                          |    6 +
 .../use_pass_instrument.py                         |    6 +
 .../tune_relay_cuda.ipynb                          |   11 +
 .../using_external_lib.py                          |    6 +
 .../intrin_math.py                                 |    8 +-
 .../schedule_primitives.py                         |    6 +
 .../dabb6b43ea9ef9d7bd1a3912001deace/build_gcn.py  |    6 +
 .../tune_conv2d_layer_cuda.py                      |    6 +
 .../tune_network_x86.py                            |    6 +
 .../auto_scheduler_matmul_x86.py                   |    6 +
 .../tune_network_cuda.py                           |    6 +
 .../eb551cfff8900ec35fae9f15aa728e45/from_onnx.py  |    6 +
 .../bring_your_own_datatypes.py                    |    6 +
 .../tvmc_command_line_driver.ipynb                 |   11 +
 .../cross_compilation_and_rpc.ipynb                |   11 +
 .../from_oneflow.py                                |    6 +
 .../from_pytorch.py                                |    6 +
 .../deploy_prequantized.py                         |    6 +
 .../how_to/compile_models/from_coreml.rst.txt      |   23 +-
 .../how_to/compile_models/from_darknet.rst.txt     |   28 +-
 .../how_to/compile_models/from_keras.rst.txt       |   23 +-
 .../how_to/compile_models/from_mxnet.rst.txt       |   29 +-
 .../how_to/compile_models/from_oneflow.rst.txt     |   29 +-
 .../how_to/compile_models/from_onnx.rst.txt        |   29 +-
 .../how_to/compile_models/from_paddle.rst.txt      |   28 +-
 .../how_to/compile_models/from_pytorch.rst.txt     |   29 +-
 .../how_to/compile_models/from_tensorflow.rst.txt  |   44 +-
 .../how_to/compile_models/from_tflite.rst.txt      |   36 +-
 .../compile_models/sg_execution_times.rst.txt      |   22 +-
 .../deploy_models/deploy_model_on_android.rst.txt  |   37 +-
 .../deploy_models/deploy_model_on_rasp.rst.txt     |   37 +-
 .../deploy_object_detection_pytorch.rst.txt        |   31 +-
 .../deploy_models/deploy_prequantized.rst.txt      |   75 +-
 .../deploy_prequantized_tflite.rst.txt             |   81 +-
 .../how_to/deploy_models/deploy_quantized.rst.txt  |   25 +-
 .../how_to/deploy_models/deploy_sparse.rst.txt     |   45 +-
 .../deploy_models/deploy_ssd_gluoncv.rst.txt       |   27 +-
 .../deploy_models/sg_execution_times.rst.txt       |   16 +-
 .../extend_tvm/bring_your_own_datatypes.rst.txt    |   83 +-
 .../extend_tvm/low_level_custom_pass.rst.txt       |   23 +-
 .../how_to/extend_tvm/sg_execution_times.rst.txt   |   10 +-
 .../how_to/extend_tvm/use_pass_infra.rst.txt       |   45 +-
 .../how_to/extend_tvm/use_pass_instrument.rst.txt  |   83 +-
 .../optimize_operators/opt_conv_cuda.rst.txt       |   39 +-
 .../optimize_operators/opt_conv_tensorcore.rst.txt |   43 +-
 .../how_to/optimize_operators/opt_gemm.rst.txt     |   89 +-
 .../optimize_operators/sg_execution_times.rst.txt  |    8 +-
 .../sg_execution_times.rst.txt                     |   14 +-
 .../tune_conv2d_layer_cuda.rst.txt                 | 1387 ++++----------------
 .../tune_network_arm.rst.txt                       |   29 +-
 .../tune_network_cuda.rst.txt                      |   27 +-
 .../tune_network_mali.rst.txt                      |   27 +-
 .../tune_network_x86.rst.txt                       |   29 +-
 .../tune_sparse_x86.rst.txt                        |  174 ++-
 .../tune_with_autotvm/sg_execution_times.rst.txt   |   10 +-
 .../tune_with_autotvm/tune_conv2d_cuda.rst.txt     |   63 +-
 .../tune_with_autotvm/tune_relay_arm.rst.txt       |   43 +-
 .../tune_with_autotvm/tune_relay_cuda.rst.txt      |   45 +-
 .../tune_relay_mobile_gpu.rst.txt                  |   45 +-
 .../tune_with_autotvm/tune_relay_x86.rst.txt       |   17 +-
 .../work_with_microtvm/micro_autotune.rst.txt      |   47 +-
 .../how_to/work_with_microtvm/micro_ethosu.rst.txt |   51 +-
 .../how_to/work_with_microtvm/micro_tflite.rst.txt |   41 +-
 .../how_to/work_with_microtvm/micro_train.rst.txt  |   16 +-
 .../work_with_microtvm/sg_execution_times.rst.txt  |   14 +-
 .../how_to/work_with_relay/build_gcn.rst.txt       |   31 +-
 .../work_with_relay/sg_execution_times.rst.txt     |    6 +-
 .../work_with_relay/using_external_lib.rst.txt     |   25 +-
 .../how_to/work_with_relay/using_relay_viz.rst.txt |   25 +-
 .../how_to/work_with_schedules/extern_op.rst.txt   |   21 +-
 .../how_to/work_with_schedules/intrin_math.rst.txt |   32 +-
 .../how_to/work_with_schedules/reduction.rst.txt   |   47 +-
 .../how_to/work_with_schedules/scan.rst.txt        |   29 +-
 .../schedule_primitives.rst.txt                    |   55 +-
 .../work_with_schedules/sg_execution_times.rst.txt |   16 +-
 .../how_to/work_with_schedules/tedd.rst.txt        |   29 +-
 .../how_to/work_with_schedules/tensorize.rst.txt   |   47 +-
 .../work_with_schedules/tuple_inputs.rst.txt       |   19 +-
 .../tutorials/autotvm/sg_execution_times.rst.txt   |    4 +-
 .../frontend/deploy_classification.rst.txt         |    2 +-
 .../tutorials/frontend/deploy_detection.rst.txt    |    2 +-
 .../tutorials/frontend/sg_execution_times.rst.txt  |    6 +-
 .../tutorials/optimize/sg_execution_times.rst.txt  |    6 +-
 .../topic/vta/tutorials/sg_execution_times.rst.txt |    6 +-
 .../tutorial/auto_scheduler_matmul_x86.rst.txt     |   51 +-
 docs/_sources/tutorial/autotvm_matmul_x86.rst.txt  |   75 +-
 docs/_sources/tutorial/autotvm_relay_x86.rst.txt   |  137 +-
 .../tutorial/cross_compilation_and_rpc.rst.txt     |   45 +-
 docs/_sources/tutorial/install.rst.txt             |   17 +-
 docs/_sources/tutorial/intro_topi.rst.txt          |   45 +-
 docs/_sources/tutorial/introduction.rst.txt        |   15 +-
 docs/_sources/tutorial/relay_quick_start.rst.txt   |   35 +-
 docs/_sources/tutorial/sg_execution_times.rst.txt  |   26 +-
 .../tutorial/tensor_expr_get_started.rst.txt       |  183 +--
 .../tutorial/tensor_ir_blitz_course.rst.txt        |   41 +-
 .../tutorial/tvmc_command_line_driver.rst.txt      |   41 +-
 docs/_sources/tutorial/tvmc_python.rst.txt         |   41 +-
 docs/commit_hash                                   |    2 +-
 docs/how_to/compile_models/from_darknet.html       |    1 -
 docs/how_to/compile_models/from_mxnet.html         |    2 +-
 docs/how_to/compile_models/from_oneflow.html       |   57 +-
 docs/how_to/compile_models/from_onnx.html          |    4 +-
 docs/how_to/compile_models/from_paddle.html        |    1 -
 docs/how_to/compile_models/from_pytorch.html       |    7 +-
 docs/how_to/compile_models/from_tensorflow.html    |    1 -
 docs/how_to/compile_models/from_tflite.html        |    3 +
 docs/how_to/compile_models/sg_execution_times.html |   38 +-
 .../deploy_models/deploy_model_on_android.html     |    2 +-
 .../deploy_object_detection_pytorch.html           |   26 +-
 docs/how_to/deploy_models/deploy_prequantized.html |   15 +-
 .../deploy_models/deploy_prequantized_tflite.html  |    7 +-
 docs/how_to/deploy_models/deploy_quantized.html    |    2 +-
 docs/how_to/deploy_models/deploy_sparse.html       |    3 +
 docs/how_to/deploy_models/deploy_ssd_gluoncv.html  |   36 +-
 docs/how_to/deploy_models/sg_execution_times.html  |   24 +-
 .../extend_tvm/bring_your_own_datatypes.html       |    5 +-
 docs/how_to/extend_tvm/sg_execution_times.html     |   10 +-
 docs/how_to/extend_tvm/use_pass_instrument.html    |   16 +-
 docs/how_to/optimize_operators/opt_conv_cuda.html  |    5 +-
 .../optimize_operators/opt_conv_tensorcore.html    |    5 +-
 docs/how_to/optimize_operators/opt_gemm.html       |   19 +-
 .../optimize_operators/sg_execution_times.html     |    8 +-
 .../sg_execution_times.html                        |   14 +-
 .../tune_conv2d_layer_cuda.html                    | 1350 ++++---------------
 .../tune_with_autoscheduler/tune_network_cuda.html |    2 +-
 .../tune_with_autoscheduler/tune_network_x86.html  |    4 +-
 .../tune_with_autoscheduler/tune_sparse_x86.html   |  137 +-
 .../tune_with_autotvm/sg_execution_times.html      |   10 +-
 .../how_to/tune_with_autotvm/tune_conv2d_cuda.html |   37 +-
 docs/how_to/tune_with_autotvm/tune_relay_arm.html  |    3 +
 docs/how_to/tune_with_autotvm/tune_relay_cuda.html |    3 +
 .../tune_with_autotvm/tune_relay_mobile_gpu.html   |    3 +
 docs/how_to/work_with_microtvm/micro_autotune.html |   16 +-
 docs/how_to/work_with_microtvm/micro_ethosu.html   |    3 +
 docs/how_to/work_with_microtvm/micro_tflite.html   |    3 +
 docs/how_to/work_with_microtvm/micro_train.html    |   16 +-
 .../work_with_microtvm/sg_execution_times.html     |   16 +-
 docs/how_to/work_with_relay/build_gcn.html         |    1 +
 .../how_to/work_with_relay/sg_execution_times.html |    6 +-
 docs/how_to/work_with_schedules/extern_op.html     |    1 +
 docs/how_to/work_with_schedules/intrin_math.html   |    4 +-
 docs/how_to/work_with_schedules/reduction.html     |    1 +
 docs/how_to/work_with_schedules/scan.html          |    1 +
 .../work_with_schedules/schedule_primitives.html   |    1 +
 .../work_with_schedules/sg_execution_times.html    |   16 +-
 docs/how_to/work_with_schedules/tensorize.html     |    3 +-
 docs/how_to/work_with_schedules/tuple_inputs.html  |    1 +
 docs/reference/api/python/auto_scheduler.html      |    4 +-
 .../api/typedoc/classes/bytestreamreader.html      |   12 +-
 .../api/typedoc/classes/cachedcallstack.html       |   34 +-
 docs/reference/api/typedoc/classes/dldatatype.html |   12 +-
 docs/reference/api/typedoc/classes/dldevice.html   |   10 +-
 .../reference/api/typedoc/classes/environment.html |   12 +-
 docs/reference/api/typedoc/classes/ffilibrary.html |   20 +-
 .../api/typedoc/classes/graphexecutor.html         |   16 +-
 docs/reference/api/typedoc/classes/instance.html   |   40 +-
 docs/reference/api/typedoc/classes/memory.html     |   34 +-
 docs/reference/api/typedoc/classes/module.html     |   10 +-
 docs/reference/api/typedoc/classes/ndarray.html    |   22 +-
 .../api/typedoc/classes/packedfunccell.html        |    6 +-
 docs/reference/api/typedoc/classes/rpcserver.html  |   14 +-
 docs/reference/api/typedoc/classes/scalar.html     |    6 +-
 .../api/typedoc/classes/webgpucontext.html         |   12 +-
 docs/reference/api/typedoc/enums/argtypecode.html  |   30 +-
 .../api/typedoc/enums/aynccallbackcode.html        |    4 +-
 .../api/typedoc/enums/dldatatypecode.html          |    8 +-
 .../api/typedoc/enums/rpcserverstate.html          |   12 +-
 docs/reference/api/typedoc/enums/sizeof.html       |   18 +-
 docs/reference/api/typedoc/index.html              |  112 +-
 .../api/typedoc/interfaces/disposable.html         |    2 +-
 .../api/typedoc/interfaces/functioninfo.html       |    6 +-
 .../api/typedoc/interfaces/libraryprovider.html    |    4 +-
 docs/searchindex.js                                |    2 +-
 .../vta/tutorials/autotvm/sg_execution_times.html  |    4 +-
 .../tutorials/frontend/deploy_classification.html  |    2 +-
 .../vta/tutorials/frontend/deploy_detection.html   |    2 +-
 .../vta/tutorials/frontend/sg_execution_times.html |    6 +-
 .../vta/tutorials/optimize/sg_execution_times.html |    6 +-
 docs/topic/vta/tutorials/sg_execution_times.html   |    6 +-
 docs/tutorial/auto_scheduler_matmul_x86.html       |    6 +-
 docs/tutorial/autotvm_matmul_x86.html              |   23 +-
 docs/tutorial/autotvm_relay_x86.html               |  261 ++--
 docs/tutorial/cross_compilation_and_rpc.html       |    3 +
 docs/tutorial/install.html                         |    3 +
 docs/tutorial/intro_topi.html                      |    2 +-
 docs/tutorial/introduction.html                    |    3 +
 docs/tutorial/relay_quick_start.html               |    3 +
 docs/tutorial/sg_execution_times.html              |   40 +-
 docs/tutorial/tensor_expr_get_started.html         |   43 +-
 docs/tutorial/tvmc_command_line_driver.html        |    3 +
 docs/tutorial/tvmc_python.html                     |    3 +
 267 files changed, 3521 insertions(+), 4316 deletions(-)

diff --git a/docs/_downloads/0387f07dee851b2b8c6b73e3e88c3140/tune_relay_cuda.py b/docs/_downloads/0387f07dee851b2b8c6b73e3e88c3140/tune_relay_cuda.py
index b2af2e13f..459b2798c 100644
--- a/docs/_downloads/0387f07dee851b2b8c6b73e3e88c3140/tune_relay_cuda.py
+++ b/docs/_downloads/0387f07dee851b2b8c6b73e3e88c3140/tune_relay_cuda.py
@@ -39,6 +39,12 @@ get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ######################################################################
 # Install dependencies
 # --------------------
diff --git a/docs/_downloads/067cf39a44d9f315a39f8a7547c556d8/install.py b/docs/_downloads/067cf39a44d9f315a39f8a7547c556d8/install.py
index 0eb3ccc94..a499b0379 100644
--- a/docs/_downloads/067cf39a44d9f315a39f8a7547c556d8/install.py
+++ b/docs/_downloads/067cf39a44d9f315a39f8a7547c556d8/install.py
@@ -28,6 +28,12 @@ methods for installing TVM. These include:
 * Installing from third-party binary package.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ################################################################################
 # Installing From Source
 # ----------------------
diff --git a/docs/_downloads/07733b6b2cc4df026fce525285e8f538/tune_sparse_x86.py b/docs/_downloads/07733b6b2cc4df026fce525285e8f538/tune_sparse_x86.py
index 55ee76ef6..0a2ddbd1b 100644
--- a/docs/_downloads/07733b6b2cc4df026fce525285e8f538/tune_sparse_x86.py
+++ b/docs/_downloads/07733b6b2cc4df026fce525285e8f538/tune_sparse_x86.py
@@ -35,6 +35,12 @@ get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import os
 
 import numpy as np
diff --git a/docs/_downloads/0b60295044fd20226a0d5adc52b50b2f/deploy_sparse.ipynb b/docs/_downloads/0b60295044fd20226a0d5adc52b50b2f/deploy_sparse.ipynb
index d47acc203..bae6ce242 100644
--- a/docs/_downloads/0b60295044fd20226a0d5adc52b50b2f/deploy_sparse.ipynb
+++ b/docs/_downloads/0b60295044fd20226a0d5adc52b50b2f/deploy_sparse.ipynb
@@ -18,6 +18,17 @@
         "\n# Deploy a Hugging Face Pruned Model on CPU\n**Author**: [Josh Fromm](https://github.com/jwfromm)\n\nThis tutorial demonstrates how to take any pruned model, in this case [PruneBert\nfrom Hugging Face](https://huggingface.co/huggingface/prunebert-base-uncased-6-finepruned-w-distil-squad),\nand use TVM to leverage the model's sparsity support to produce real speedups. Although\nthe primary purpose of this tutorial is to realize speedups on already pruned\nmodels, it may also be [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/0e2f38fcb1a1fb3e636e5953aa600dee/from_mxnet.py b/docs/_downloads/0e2f38fcb1a1fb3e636e5953aa600dee/from_mxnet.py
index 027e9e6eb..380846186 100644
--- a/docs/_downloads/0e2f38fcb1a1fb3e636e5953aa600dee/from_mxnet.py
+++ b/docs/_downloads/0e2f38fcb1a1fb3e636e5953aa600dee/from_mxnet.py
@@ -35,6 +35,12 @@ A quick solution is
 or please refer to official installation guide.
 https://mxnet.apache.org/versions/master/install/index.html
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 # some standard imports
 import mxnet as mx
 import tvm
diff --git a/docs/_downloads/0f8d36b3ffd04a5a08089dc671eb788e/opt_gemm.ipynb b/docs/_downloads/0f8d36b3ffd04a5a08089dc671eb788e/opt_gemm.ipynb
index f76e95a75..19c4dc5b8 100644
--- a/docs/_downloads/0f8d36b3ffd04a5a08089dc671eb788e/opt_gemm.ipynb
+++ b/docs/_downloads/0f8d36b3ffd04a5a08089dc671eb788e/opt_gemm.ipynb
@@ -18,6 +18,17 @@
         "\n\n# How to optimize GEMM on CPU\n**Author**: [Jian Weng](https://github.com/were),             [Ruofei Yu](https://github.com/yuruofeifei)\n\n(TL;DR) TVM provides abstract interfaces which allows users to depict an algorithm and the\nalgorithm's implementing organization (the so-called schedule) separately. Typically, writing\nalgorithm in high-performance schedule breaks the algorithm's readability and modularity. Also,\ntrying various seemingly promising schedules is time-co [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/10724e9ad9c29faa223c1d5eab6dbef9/tvmc_python.py b/docs/_downloads/10724e9ad9c29faa223c1d5eab6dbef9/tvmc_python.py
index 6efc565f0..28b0a9745 100644
--- a/docs/_downloads/10724e9ad9c29faa223c1d5eab6dbef9/tvmc_python.py
+++ b/docs/_downloads/10724e9ad9c29faa223c1d5eab6dbef9/tvmc_python.py
@@ -36,6 +36,12 @@ Follow the steps to download a resnet model via the terminal:
 Let's start editing the python file in your favorite text editor.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ################################################################################
 # Step 0: Imports
 # ~~~~~~~~~~~~~~~
diff --git a/docs/_downloads/10d831d158490a9ee3abd1901806fc11/reduction.ipynb b/docs/_downloads/10d831d158490a9ee3abd1901806fc11/reduction.ipynb
index e3219ff64..f9cefac27 100644
--- a/docs/_downloads/10d831d158490a9ee3abd1901806fc11/reduction.ipynb
+++ b/docs/_downloads/10d831d158490a9ee3abd1901806fc11/reduction.ipynb
@@ -26,7 +26,7 @@
       },
       "outputs": [],
       "source": [
-        "from __future__ import absolute_import, print_function\n\nimport tvm\nimport tvm.testing\nfrom tvm import te\nimport numpy as np"
+        "from __future__ import absolute_import, print_function\n\n\nimport tvm\nimport tvm.testing\nfrom tvm import te\nimport numpy as np"
       ]
     },
     {
diff --git a/docs/_downloads/16269b77359771348d507395692524cf/from_paddle.py b/docs/_downloads/16269b77359771348d507395692524cf/from_paddle.py
index 9d67cbcdf..fecb1c48d 100644
--- a/docs/_downloads/16269b77359771348d507395692524cf/from_paddle.py
+++ b/docs/_downloads/16269b77359771348d507395692524cf/from_paddle.py
@@ -30,6 +30,12 @@ A quick solution is
 or please refer to official site.
 https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tarfile
 import paddle
 import numpy as np
diff --git a/docs/_downloads/17b139d609f9480c7eeda2da1f90f28c/tune_network_arm.py b/docs/_downloads/17b139d609f9480c7eeda2da1f90f28c/tune_network_arm.py
index 9c5820c99..09a1d0cea 100644
--- a/docs/_downloads/17b139d609f9480c7eeda2da1f90f28c/tune_network_arm.py
+++ b/docs/_downloads/17b139d609f9480c7eeda2da1f90f28c/tune_network_arm.py
@@ -46,6 +46,12 @@ get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import numpy as np
 import os
 
diff --git a/docs/_downloads/1a26d790f7b98309d730181290dae3ee/deploy_prequantized_tflite.ipynb b/docs/_downloads/1a26d790f7b98309d730181290dae3ee/deploy_prequantized_tflite.ipynb
index ab0c80fba..f7fc826ff 100644
--- a/docs/_downloads/1a26d790f7b98309d730181290dae3ee/deploy_prequantized_tflite.ipynb
+++ b/docs/_downloads/1a26d790f7b98309d730181290dae3ee/deploy_prequantized_tflite.ipynb
@@ -18,6 +18,17 @@
         "\n# Deploy a Framework-prequantized Model with TVM - Part 3 (TFLite)\n**Author**: [Siju Samuel](https://github.com/siju-samuel)\n\nWelcome to part 3 of the Deploy Framework-Prequantized Model with TVM tutorial.\nIn this part, we will start with a Quantized TFLite graph and then compile and execute it via TVM.\n\n\nFor more details on quantizing the model using TFLite, readers are encouraged to\ngo through [Converting Quantized Models](https://www.tensorflow.org/lite/convert/quan [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/1e482ba1190961191e3a0bdbd0585faa/intrin_math.ipynb b/docs/_downloads/1e482ba1190961191e3a0bdbd0585faa/intrin_math.ipynb
index 377c7cf58..553aef0b2 100644
--- a/docs/_downloads/1e482ba1190961191e3a0bdbd0585faa/intrin_math.ipynb
+++ b/docs/_downloads/1e482ba1190961191e3a0bdbd0585faa/intrin_math.ipynb
@@ -26,7 +26,7 @@
       },
       "outputs": [],
       "source": [
-        "from __future__ import absolute_import, print_function\nimport numpy as np\n\nimport tvm\nfrom tvm import te\nfrom tvm.ir import register_op_attr, register_intrin_lowering"
+        "from __future__ import absolute_import, print_function\n\n\nimport numpy as np\n\nimport tvm\nfrom tvm import te\nfrom tvm.ir import register_op_attr, register_intrin_lowering"
       ]
     },
     {
diff --git a/docs/_downloads/21a9dd883b196be58ca1c5cd02700274/deploy_model_on_android.py b/docs/_downloads/21a9dd883b196be58ca1c5cd02700274/deploy_model_on_android.py
index c7b610d5d..10e108239 100644
--- a/docs/_downloads/21a9dd883b196be58ca1c5cd02700274/deploy_model_on_android.py
+++ b/docs/_downloads/21a9dd883b196be58ca1c5cd02700274/deploy_model_on_android.py
@@ -25,6 +25,12 @@ Deploy the Pretrained Model on Android
 This is an example of using Relay to compile a keras model and deploy it on Android device.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import os
 import numpy as np
 from PIL import Image
diff --git a/docs/_downloads/233ceda3a682ae5df93b4ce0bcfbf870/tvmc_command_line_driver.py b/docs/_downloads/233ceda3a682ae5df93b4ce0bcfbf870/tvmc_command_line_driver.py
index 48e3703be..ad5b37190 100644
--- a/docs/_downloads/233ceda3a682ae5df93b4ce0bcfbf870/tvmc_command_line_driver.py
+++ b/docs/_downloads/233ceda3a682ae5df93b4ce0bcfbf870/tvmc_command_line_driver.py
@@ -41,6 +41,12 @@ The goal of this section is to give you an overview of TVM and TVMC's
 capabilities, and set the stage for understanding how TVM works.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ################################################################################
 # Using TVMC
 # ----------
diff --git a/docs/_downloads/23968bb778cd9591b7ad858bf17dcc3e/from_tflite.ipynb b/docs/_downloads/23968bb778cd9591b7ad858bf17dcc3e/from_tflite.ipynb
index 940533d06..822313022 100644
--- a/docs/_downloads/23968bb778cd9591b7ad858bf17dcc3e/from_tflite.ipynb
+++ b/docs/_downloads/23968bb778cd9591b7ad858bf17dcc3e/from_tflite.ipynb
@@ -18,6 +18,17 @@
         "\n# Compile TFLite Models\n**Author**: [Zhao Wu](https://github.com/FrozenGene)\n\nThis article is an introductory tutorial to deploy TFLite models with Relay.\n\nTo get started, TFLite package needs to be installed as prerequisite.\n\n```bash\n# install tflite\npip install tflite==2.1.0 --user\n```\nor you could generate TFLite package yourself. The steps are the following:\n\n```bash\n# Get the flatc compiler.\n# Please refer to https://github.com/google/flatbuffers for detail [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/286e7f77f494a25312ac88e3f234822e/extern_op.py b/docs/_downloads/286e7f77f494a25312ac88e3f234822e/extern_op.py
index a0aa5d724..ad741a08d 100644
--- a/docs/_downloads/286e7f77f494a25312ac88e3f234822e/extern_op.py
+++ b/docs/_downloads/286e7f77f494a25312ac88e3f234822e/extern_op.py
@@ -31,6 +31,12 @@ or pointer to DLTensor as argument.
 """
 from __future__ import absolute_import, print_function
 
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 from tvm import te
 import numpy as np
diff --git a/docs/_downloads/2a0982f8ca0176cb17713d28286536e4/reduction.py b/docs/_downloads/2a0982f8ca0176cb17713d28286536e4/reduction.py
index 164f36daf..432e9cd14 100644
--- a/docs/_downloads/2a0982f8ca0176cb17713d28286536e4/reduction.py
+++ b/docs/_downloads/2a0982f8ca0176cb17713d28286536e4/reduction.py
@@ -27,6 +27,12 @@ In this tutorial, we will demonstrate how to do reduction in TVM.
 """
 from __future__ import absolute_import, print_function
 
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 import tvm.testing
 from tvm import te
diff --git a/docs/_downloads/2a4c6a9cfa43e8afef159a2bf1b99108/install.ipynb b/docs/_downloads/2a4c6a9cfa43e8afef159a2bf1b99108/install.ipynb
index 86f13e985..f025d8d1b 100644
--- a/docs/_downloads/2a4c6a9cfa43e8afef159a2bf1b99108/install.ipynb
+++ b/docs/_downloads/2a4c6a9cfa43e8afef159a2bf1b99108/install.ipynb
@@ -18,6 +18,17 @@
         "\n# Installing TVM\n**Authors**:\n[Jocelyn Shiue](https://github.com/),\n[Chris Hoge](https://github.com/hogepodge)\n\nDepending on your needs and your working environment, there are a few different\nmethods for installing TVM. These include:\n\n* Installing from source\n* Installing from third-party binary package.\n"
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/2f91b1346a0ba21b800081aa15fdaac2/autotvm_relay_x86.ipynb b/docs/_downloads/2f91b1346a0ba21b800081aa15fdaac2/autotvm_relay_x86.ipynb
index afcd16f40..1416cc838 100644
--- a/docs/_downloads/2f91b1346a0ba21b800081aa15fdaac2/autotvm_relay_x86.ipynb
+++ b/docs/_downloads/2f91b1346a0ba21b800081aa15fdaac2/autotvm_relay_x86.ipynb
@@ -18,6 +18,17 @@
         "\n# Compiling and Optimizing a Model with the Python Interface (AutoTVM)\n**Author**:\n[Chris Hoge](https://github.com/hogepodge)\n\nIn the [TVMC Tutorial](tvmc_command_line_driver), we covered how to compile, run, and tune a\npre-trained vision model, ResNet-50 v2 using the command line interface for\nTVM, TVMC. TVM is more that just a command-line tool though, it is an\noptimizing framework with APIs available for a number of different languages\nthat gives you tremendous flex [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/2fb9ae7bf124f72614a43137cf2919cb/micro_tflite.py b/docs/_downloads/2fb9ae7bf124f72614a43137cf2919cb/micro_tflite.py
index 3d871ba78..dfe33eeda 100644
--- a/docs/_downloads/2fb9ae7bf124f72614a43137cf2919cb/micro_tflite.py
+++ b/docs/_downloads/2fb9ae7bf124f72614a43137cf2919cb/micro_tflite.py
@@ -25,6 +25,12 @@ This tutorial is an introduction to working with microTVM and a TFLite
 model with Relay.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ######################################################################
 # .. note::
 #     If you want to run this tutorial on the microTVM Reference VM, download the Jupyter
diff --git a/docs/_downloads/31d82e25454740f5ba711497485c0dd4/introduction.py b/docs/_downloads/31d82e25454740f5ba711497485c0dd4/introduction.py
index 5fe4b4e5f..908a8e52c 100644
--- a/docs/_downloads/31d82e25454740f5ba711497485c0dd4/introduction.py
+++ b/docs/_downloads/31d82e25454740f5ba711497485c0dd4/introduction.py
@@ -45,6 +45,12 @@ Contents
 #. :doc:`Compiling Deep Learning Models for GPUs <relay_quick_start>`
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ################################################################################
 # An Overview of TVM and Model Optimization
 # =========================================
diff --git a/docs/_downloads/35eacf8f75629e07aeda1329bdb7d53c/tune_relay_arm.py b/docs/_downloads/35eacf8f75629e07aeda1329bdb7d53c/tune_relay_arm.py
index f072c5dda..ab278021d 100644
--- a/docs/_downloads/35eacf8f75629e07aeda1329bdb7d53c/tune_relay_arm.py
+++ b/docs/_downloads/35eacf8f75629e07aeda1329bdb7d53c/tune_relay_arm.py
@@ -41,6 +41,12 @@ get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ######################################################################
 # Install dependencies
 # --------------------
diff --git a/docs/_downloads/37bbf9e2065ec8deeb64a8d9fa0755bc/autotvm_matmul_x86.ipynb b/docs/_downloads/37bbf9e2065ec8deeb64a8d9fa0755bc/autotvm_matmul_x86.ipynb
index 37040c308..e4fd81d2c 100644
--- a/docs/_downloads/37bbf9e2065ec8deeb64a8d9fa0755bc/autotvm_matmul_x86.ipynb
+++ b/docs/_downloads/37bbf9e2065ec8deeb64a8d9fa0755bc/autotvm_matmul_x86.ipynb
@@ -18,6 +18,17 @@
         "\n\n# Optimizing Operators with Schedule Templates and AutoTVM\n**Authors**:\n[Lianmin Zheng](https://github.com/merrymercy),\n[Chris Hoge](https://github.com/hogepodge)\n\nIn this tutorial, we show how the TVM Tensor Expression (TE) language\ncan be used to write schedule templates that can be searched by AutoTVM to\nfind the optimal schedule. This process is called Auto-Tuning, which helps\nautomate the process of optimizing tensor computation.\n\nThis tutorial builds on the p [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/3a9b1d387f618487c8ccf6b8b78ae179/intro_topi.py b/docs/_downloads/3a9b1d387f618487c8ccf6b8b78ae179/intro_topi.py
index 17fa3ff37..e10a74c84 100644
--- a/docs/_downloads/3a9b1d387f618487c8ccf6b8b78ae179/intro_topi.py
+++ b/docs/_downloads/3a9b1d387f618487c8ccf6b8b78ae179/intro_topi.py
@@ -26,6 +26,12 @@ TOPI provides numpy-style generic operations and schedules with higher abstracti
 In this tutorial, we will see how TOPI can save us from writing boilerplate code in TVM.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import tvm
 import tvm.testing
 from tvm import te
diff --git a/docs/_downloads/3aeab7c9d659bf5da70126a1aff7c403/from_coreml.py b/docs/_downloads/3aeab7c9d659bf5da70126a1aff7c403/from_coreml.py
index 98d1969f3..96d296794 100644
--- a/docs/_downloads/3aeab7c9d659bf5da70126a1aff7c403/from_coreml.py
+++ b/docs/_downloads/3aeab7c9d659bf5da70126a1aff7c403/from_coreml.py
@@ -34,6 +34,12 @@ A quick solution is to install via pip
 or please refer to official site
 https://github.com/apple/coremltools
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 from tvm import te
 import tvm.relay as relay
diff --git a/docs/_downloads/3b5e41b16a898b72d18127ebe2182c66/tensorize.ipynb b/docs/_downloads/3b5e41b16a898b72d18127ebe2182c66/tensorize.ipynb
index 72f3a8a3a..c3e2eb8c2 100644
--- a/docs/_downloads/3b5e41b16a898b72d18127ebe2182c66/tensorize.ipynb
+++ b/docs/_downloads/3b5e41b16a898b72d18127ebe2182c66/tensorize.ipynb
@@ -26,7 +26,7 @@
       },
       "outputs": [],
       "source": [
-        "from __future__ import absolute_import, print_function\n\nimport tvm\nfrom tvm import te\nimport tvm.testing\nimport numpy as np"
+        "from __future__ import absolute_import, print_function\n\n\nimport tvm\nfrom tvm import te\nimport tvm.testing\nimport numpy as np"
       ]
     },
     {
diff --git a/docs/_downloads/3c5c85c3954f3110f16ca084e286f03a/opt_conv_cuda.py b/docs/_downloads/3c5c85c3954f3110f16ca084e286f03a/opt_conv_cuda.py
index 3d2caa0d3..e5b452af6 100644
--- a/docs/_downloads/3c5c85c3954f3110f16ca084e286f03a/opt_conv_cuda.py
+++ b/docs/_downloads/3c5c85c3954f3110f16ca084e286f03a/opt_conv_cuda.py
@@ -30,6 +30,12 @@ channel, batch.
 
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ################################################################
 # Preparation and Algorithm
 # -------------------------
diff --git a/docs/_downloads/3dd2108354ac3028c96bcd6a0c7899dd/relay_quick_start.ipynb b/docs/_downloads/3dd2108354ac3028c96bcd6a0c7899dd/relay_quick_start.ipynb
index 5329ce098..c679065a7 100644
--- a/docs/_downloads/3dd2108354ac3028c96bcd6a0c7899dd/relay_quick_start.ipynb
+++ b/docs/_downloads/3dd2108354ac3028c96bcd6a0c7899dd/relay_quick_start.ipynb
@@ -18,6 +18,17 @@
         "\n\n# Quick Start Tutorial for Compiling Deep Learning Models\n**Author**: [Yao Wang](https://github.com/kevinthesun), [Truman Tian](https://github.com/SiNZeRo)\n\nThis example shows how to build a neural network with Relay python frontend and\ngenerates a runtime library for Nvidia GPU with TVM.\nNotice that you need to build TVM with cuda and llvm enabled.\n"
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/408ef96692a668b33d94eca33cac7e0a/deploy_model_on_rasp.py b/docs/_downloads/408ef96692a668b33d94eca33cac7e0a/deploy_model_on_rasp.py
index de4ed9aff..ab5374d93 100644
--- a/docs/_downloads/408ef96692a668b33d94eca33cac7e0a/deploy_model_on_rasp.py
+++ b/docs/_downloads/408ef96692a668b33d94eca33cac7e0a/deploy_model_on_rasp.py
@@ -26,6 +26,12 @@ This is an example of using Relay to compile a ResNet model and deploy
 it on Raspberry Pi.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import tvm
 from tvm import te
 import tvm.relay as relay
diff --git a/docs/_downloads/40a01cffb015a67aaec0fad7e27cf80d/tensor_expr_get_started.py b/docs/_downloads/40a01cffb015a67aaec0fad7e27cf80d/tensor_expr_get_started.py
index 25ea4e8a5..11186d2f1 100644
--- a/docs/_downloads/40a01cffb015a67aaec0fad7e27cf80d/tensor_expr_get_started.py
+++ b/docs/_downloads/40a01cffb015a67aaec0fad7e27cf80d/tensor_expr_get_started.py
@@ -39,6 +39,12 @@ serve as the comparative basis for future tutorials covering more advanced
 features of TVM.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ################################################################################
 # Example 1: Writing and Scheduling Vector Addition in TE for CPU
 # ---------------------------------------------------------------
diff --git a/docs/_downloads/428c6201e29ce74e73c6b41eee589f62/tensorize.py b/docs/_downloads/428c6201e29ce74e73c6b41eee589f62/tensorize.py
index 40e68074a..45eaf349f 100644
--- a/docs/_downloads/428c6201e29ce74e73c6b41eee589f62/tensorize.py
+++ b/docs/_downloads/428c6201e29ce74e73c6b41eee589f62/tensorize.py
@@ -34,6 +34,12 @@ and usage of tensorize instead of providing an efficient solution.
 """
 from __future__ import absolute_import, print_function
 
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 from tvm import te
 import tvm.testing
diff --git a/docs/_downloads/4459ebf5b03d332f7f380abdaef81c05/tensor_expr_get_started.ipynb b/docs/_downloads/4459ebf5b03d332f7f380abdaef81c05/tensor_expr_get_started.ipynb
index 3c861b2b0..9abc7a929 100644
--- a/docs/_downloads/4459ebf5b03d332f7f380abdaef81c05/tensor_expr_get_started.ipynb
+++ b/docs/_downloads/4459ebf5b03d332f7f380abdaef81c05/tensor_expr_get_started.ipynb
@@ -18,6 +18,17 @@
         "\n\n# Working with Operators Using Tensor Expression\n**Author**: [Tianqi Chen](https://tqchen.github.io)\n\nIn this tutorial we will turn our attention to how TVM works with Tensor\nExpression (TE) to define tensor computations and apply loop optimizations. TE\ndescribes tensor computations in a pure functional language (that is each\nexpression has no side effects). When viewed in context of the TVM as a whole,\nRelay describes a computation as a set of operators, and each of  [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/5499fb7d70e17c6aabf49246a978db52/use_pass_infra.py b/docs/_downloads/5499fb7d70e17c6aabf49246a978db52/use_pass_infra.py
index e38383e69..a41a26fc0 100644
--- a/docs/_downloads/5499fb7d70e17c6aabf49246a978db52/use_pass_infra.py
+++ b/docs/_downloads/5499fb7d70e17c6aabf49246a978db52/use_pass_infra.py
@@ -40,6 +40,12 @@ a certain optimization and create an optimization pipeline for a Relay program.
 The same approach can be used for tir as well.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import numpy as np
 import tvm
 from tvm import te
diff --git a/docs/_downloads/55a9eff88b1303e525d53269eeb16897/micro_ethosu.ipynb b/docs/_downloads/55a9eff88b1303e525d53269eeb16897/micro_ethosu.ipynb
index cde854d21..a757c245f 100644
--- a/docs/_downloads/55a9eff88b1303e525d53269eeb16897/micro_ethosu.ipynb
+++ b/docs/_downloads/55a9eff88b1303e525d53269eeb16897/micro_ethosu.ipynb
@@ -18,6 +18,17 @@
         "\n# Running TVM on bare metal Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN\n**Author**:\n[Grant Watson](https://github.com/grant-arm)\n\nThis section contains an example of how to use TVM to run a model\non an Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN, using bare metal.\nThe Cortex(R)-M55 is a small, low-power CPU designed for use in embedded\ndevices. CMSIS-NN is a collection of kernels optimized for Arm(R) Cortex(R)-M CPUs.\nThe Ethos(TM)-U55 [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/56691c7a27d45da61d112276334640d3/deploy_prequantized_tflite.py b/docs/_downloads/56691c7a27d45da61d112276334640d3/deploy_prequantized_tflite.py
index 830e2ab07..494b4a9e2 100644
--- a/docs/_downloads/56691c7a27d45da61d112276334640d3/deploy_prequantized_tflite.py
+++ b/docs/_downloads/56691c7a27d45da61d112276334640d3/deploy_prequantized_tflite.py
@@ -42,6 +42,12 @@ Now please check if TFLite package is installed successfully, ``python -c "impor
 
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ###############################################################################
 # Necessary imports
 # -----------------
diff --git a/docs/_downloads/57a45d9bef1af358191e7d50043e652c/autotvm_relay_x86.py b/docs/_downloads/57a45d9bef1af358191e7d50043e652c/autotvm_relay_x86.py
index 4e5714a6d..b7dfbe28f 100644
--- a/docs/_downloads/57a45d9bef1af358191e7d50043e652c/autotvm_relay_x86.py
+++ b/docs/_downloads/57a45d9bef1af358191e7d50043e652c/autotvm_relay_x86.py
@@ -42,6 +42,12 @@ The goal of this section is to give you an overview of TVM's capabilites and
 how to use them through the Python API.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ################################################################################
 # TVM is a deep learning compiler framework, with a number of different modules
 # available for working with deep learning models and operators. In this
diff --git a/docs/_downloads/5b279d8a8718816263fa65b0eef1a5c0/micro_tflite.ipynb b/docs/_downloads/5b279d8a8718816263fa65b0eef1a5c0/micro_tflite.ipynb
index ddb727d98..a55654cfc 100644
--- a/docs/_downloads/5b279d8a8718816263fa65b0eef1a5c0/micro_tflite.ipynb
+++ b/docs/_downloads/5b279d8a8718816263fa65b0eef1a5c0/micro_tflite.ipynb
@@ -18,6 +18,17 @@
         "\n\n# microTVM with TFLite Models\n**Author**: [Tom Gall](https://github.com/tom-gall)\n\nThis tutorial is an introduction to working with microTVM and a TFLite\nmodel with Relay.\n"
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/5c7000b5aef924e29ec975ec3002ea03/tensor_ir_blitz_course.py b/docs/_downloads/5c7000b5aef924e29ec975ec3002ea03/tensor_ir_blitz_course.py
index 11edc7ae9..a62fa3979 100644
--- a/docs/_downloads/5c7000b5aef924e29ec975ec3002ea03/tensor_ir_blitz_course.py
+++ b/docs/_downloads/5c7000b5aef924e29ec975ec3002ea03/tensor_ir_blitz_course.py
@@ -29,6 +29,12 @@ TensorIR is a domain specific language for deep learning programs serving two br
 
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import tvm
 from tvm.ir.module import IRModule
 from tvm.script import tir as T
diff --git a/docs/_downloads/644d28fc67dfb3099fb0d275ffcf1c7c/tune_relay_mobile_gpu.py b/docs/_downloads/644d28fc67dfb3099fb0d275ffcf1c7c/tune_relay_mobile_gpu.py
index d3f4ec62f..5a4f0c56d 100644
--- a/docs/_downloads/644d28fc67dfb3099fb0d275ffcf1c7c/tune_relay_mobile_gpu.py
+++ b/docs/_downloads/644d28fc67dfb3099fb0d275ffcf1c7c/tune_relay_mobile_gpu.py
@@ -39,6 +39,12 @@ get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ######################################################################
 # Install dependencies
 # --------------------
diff --git a/docs/_downloads/67bf7dd99bcfb837cf3e8b461a5eeb48/tune_network_mali.py b/docs/_downloads/67bf7dd99bcfb837cf3e8b461a5eeb48/tune_network_mali.py
index 2d1e51520..8ac0b235d 100644
--- a/docs/_downloads/67bf7dd99bcfb837cf3e8b461a5eeb48/tune_network_mali.py
+++ b/docs/_downloads/67bf7dd99bcfb837cf3e8b461a5eeb48/tune_network_mali.py
@@ -44,6 +44,12 @@ get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import numpy as np
 
 import tvm
diff --git a/docs/_downloads/6836ce26807b8d33b8f499287c1f3d04/tune_relay_x86.py b/docs/_downloads/6836ce26807b8d33b8f499287c1f3d04/tune_relay_x86.py
index 771220bb3..6e46fbd8f 100644
--- a/docs/_downloads/6836ce26807b8d33b8f499287c1f3d04/tune_relay_x86.py
+++ b/docs/_downloads/6836ce26807b8d33b8f499287c1f3d04/tune_relay_x86.py
@@ -28,6 +28,12 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import os
 import numpy as np
 
diff --git a/docs/_downloads/68abf665197871646fffcd0955bddad7/tuple_inputs.py b/docs/_downloads/68abf665197871646fffcd0955bddad7/tuple_inputs.py
index 73db7b90a..86ec8b2d1 100644
--- a/docs/_downloads/68abf665197871646fffcd0955bddad7/tuple_inputs.py
+++ b/docs/_downloads/68abf665197871646fffcd0955bddad7/tuple_inputs.py
@@ -27,6 +27,12 @@ In this tutorial, we will introduce the usage of tuple inputs in TVM.
 """
 from __future__ import absolute_import, print_function
 
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 from tvm import te
 import numpy as np
diff --git a/docs/_downloads/6ad550da5092845382b1197f58a93816/tune_conv2d_cuda.py b/docs/_downloads/6ad550da5092845382b1197f58a93816/tune_conv2d_cuda.py
index e3072773b..95d6dcb0a 100644
--- a/docs/_downloads/6ad550da5092845382b1197f58a93816/tune_conv2d_cuda.py
+++ b/docs/_downloads/6ad550da5092845382b1197f58a93816/tune_conv2d_cuda.py
@@ -28,6 +28,12 @@ get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ######################################################################
 # Install dependencies
 # --------------------
diff --git a/docs/_downloads/6b0f549107f73f2e48c894372be08bcb/tune_relay_mobile_gpu.ipynb b/docs/_downloads/6b0f549107f73f2e48c894372be08bcb/tune_relay_mobile_gpu.ipynb
index 636c4b003..4bdcaefd1 100644
--- a/docs/_downloads/6b0f549107f73f2e48c894372be08bcb/tune_relay_mobile_gpu.ipynb
+++ b/docs/_downloads/6b0f549107f73f2e48c894372be08bcb/tune_relay_mobile_gpu.ipynb
@@ -18,6 +18,17 @@
         "\n# Auto-tuning a Convolutional Network for Mobile GPU\n**Author**: [Lianmin Zheng](https://github.com/merrymercy), [Eddie Yan](https://github.com/eqy)\n\nAuto-tuning for a specific device is critical for getting the best\nperformance. This is a tutorial about how to tune a whole convolutional\nnetwork.\n\nThe operator implementation for Mobile GPU in TVM is written in template form.\nThe template has many tunable knobs (tile factor, vectorization, unrolling, etc).\nWe will tune [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/729378592a96230b4f7be71b44da43a4/scan.ipynb b/docs/_downloads/729378592a96230b4f7be71b44da43a4/scan.ipynb
index d51058c8a..dfb6cccea 100644
--- a/docs/_downloads/729378592a96230b4f7be71b44da43a4/scan.ipynb
+++ b/docs/_downloads/729378592a96230b4f7be71b44da43a4/scan.ipynb
@@ -26,7 +26,7 @@
       },
       "outputs": [],
       "source": [
-        "from __future__ import absolute_import, print_function\n\nimport tvm\nimport tvm.testing\nfrom tvm import te\nimport numpy as np"
+        "from __future__ import absolute_import, print_function\n\n\nimport tvm\nimport tvm.testing\nfrom tvm import te\nimport numpy as np"
       ]
     },
     {
diff --git a/docs/_downloads/732ed130cbc15432e737da8cc47e1734/tune_conv2d_cuda.ipynb b/docs/_downloads/732ed130cbc15432e737da8cc47e1734/tune_conv2d_cuda.ipynb
index d5c2590b1..ef3065f7a 100644
--- a/docs/_downloads/732ed130cbc15432e737da8cc47e1734/tune_conv2d_cuda.ipynb
+++ b/docs/_downloads/732ed130cbc15432e737da8cc47e1734/tune_conv2d_cuda.ipynb
@@ -18,6 +18,17 @@
         "\n# Tuning High Performance Convolution on NVIDIA GPUs\n**Author**: [Lianmin Zheng](https://github.com/merrymercy)\n\nThis is an advanced tutorial for writing high performance tunable template for\nNVIDIA GPU. By running auto-tuner on this template, we can outperform the\nvendor provided library CuDNN in many cases.\n\nNote that this tutorial will not run on Windows or recent versions of macOS. To\nget it to run, you will need to wrap the body of this tutorial in a :code:`if\n__ [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/7372db5919b5619bc34fde3434862bca/opt_conv_tensorcore.py b/docs/_downloads/7372db5919b5619bc34fde3434862bca/opt_conv_tensorcore.py
index ccfc7b974..4cc2b40b7 100644
--- a/docs/_downloads/7372db5919b5619bc34fde3434862bca/opt_conv_tensorcore.py
+++ b/docs/_downloads/7372db5919b5619bc34fde3434862bca/opt_conv_tensorcore.py
@@ -27,6 +27,12 @@ convolution has a large batch. We strongly recommend covering the :ref:`opt-conv
 
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ################################################################
 # TensorCore Introduction
 # -----------------------
diff --git a/docs/_downloads/7455981870c23c8c76482dedf33d8a42/opt_conv_tensorcore.ipynb b/docs/_downloads/7455981870c23c8c76482dedf33d8a42/opt_conv_tensorcore.ipynb
index 30326660e..cbf2755b3 100644
--- a/docs/_downloads/7455981870c23c8c76482dedf33d8a42/opt_conv_tensorcore.ipynb
+++ b/docs/_downloads/7455981870c23c8c76482dedf33d8a42/opt_conv_tensorcore.ipynb
@@ -18,6 +18,17 @@
         "\n\n# How to optimize convolution using TensorCores\n**Author**: [Siyuan Feng](https://github.com/Hzfengsy)\n\nIn this tutorial, we will demonstrate how to write a high performance convolution\nschedule using TensorCores in TVM. In this example, we assume the input to\nconvolution has a large batch. We strongly recommend covering the `opt-conv-gpu` tutorial first.\n"
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/766206ab8f1fd80ac34d9816cb991a0d/cross_compilation_and_rpc.py b/docs/_downloads/766206ab8f1fd80ac34d9816cb991a0d/cross_compilation_and_rpc.py
index 25208369f..3f74899f7 100644
--- a/docs/_downloads/766206ab8f1fd80ac34d9816cb991a0d/cross_compilation_and_rpc.py
+++ b/docs/_downloads/766206ab8f1fd80ac34d9816cb991a0d/cross_compilation_and_rpc.py
@@ -31,6 +31,12 @@ platforms. In this tutorial, we will use the Raspberry Pi for a CPU example
 and the Firefly-RK3399 for an OpenCL example.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ######################################################################
 # Build TVM Runtime on Device
 # ---------------------------
diff --git a/docs/_downloads/7716f96385bd5abb6e822041e285be54/from_darknet.py b/docs/_downloads/7716f96385bd5abb6e822041e285be54/from_darknet.py
index 232058641..c12a9e7e1 100644
--- a/docs/_downloads/7716f96385bd5abb6e822041e285be54/from_darknet.py
+++ b/docs/_downloads/7716f96385bd5abb6e822041e285be54/from_darknet.py
@@ -31,6 +31,12 @@ Please install CFFI and CV2 before executing this script
   pip install opencv-python
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 # numpy and matplotlib
 import numpy as np
 import matplotlib.pyplot as plt
diff --git a/docs/_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py b/docs/_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py
index b5b0e4acf..0d8d0f286 100644
--- a/docs/_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py
+++ b/docs/_downloads/7795da4b258c8feff986668b95ef57ad/deploy_object_detection_pytorch.py
@@ -40,6 +40,12 @@ Currently, TVM supports PyTorch 1.7 and 1.4. Other versions may
 be unstable.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import tvm
 from tvm import relay
 from tvm import relay
diff --git a/docs/_downloads/7810ecf51bfc05f7d5e8a400ac3e815d/deploy_quantized.py b/docs/_downloads/7810ecf51bfc05f7d5e8a400ac3e815d/deploy_quantized.py
index 2d9275796..24c7ce333 100644
--- a/docs/_downloads/7810ecf51bfc05f7d5e8a400ac3e815d/deploy_quantized.py
+++ b/docs/_downloads/7810ecf51bfc05f7d5e8a400ac3e815d/deploy_quantized.py
@@ -27,6 +27,12 @@ In this tutorial, we will import a GluonCV pre-trained model on ImageNet to
 Relay, quantize the Relay model and then perform the inference.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import tvm
 from tvm import te
 from tvm import relay
diff --git a/docs/_downloads/79027b28c061178b7ea56e3f047eeef1/micro_reference_vm.py b/docs/_downloads/79027b28c061178b7ea56e3f047eeef1/micro_reference_vm.py
index 9eacd9a96..b87a72656 100644
--- a/docs/_downloads/79027b28c061178b7ea56e3f047eeef1/micro_reference_vm.py
+++ b/docs/_downloads/79027b28c061178b7ea56e3f047eeef1/micro_reference_vm.py
@@ -157,3 +157,9 @@ local QEMU emulator running within the VM, run the following commands instead:
 
 
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
diff --git a/docs/_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py b/docs/_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py
index 4563e245c..9a3239781 100644
--- a/docs/_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py
+++ b/docs/_downloads/7f1d3d1b878694c201c614c807cdebc8/from_tensorflow.py
@@ -24,6 +24,12 @@ For us to begin with, tensorflow python module is required to be installed.
 Please refer to https://www.tensorflow.org/install
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 # tvm, relay
 import tvm
 from tvm import te
diff --git a/docs/_downloads/825671e45a9bdc4733400384984cd9dd/build_gcn.ipynb b/docs/_downloads/825671e45a9bdc4733400384984cd9dd/build_gcn.ipynb
index b034c7af7..af9ebca2d 100644
--- a/docs/_downloads/825671e45a9bdc4733400384984cd9dd/build_gcn.ipynb
+++ b/docs/_downloads/825671e45a9bdc4733400384984cd9dd/build_gcn.ipynb
@@ -69,7 +69,7 @@
       },
       "outputs": [],
       "source": [
-        "\"\"\"\nParameters\n----------\ndataset: str\n    Name of dataset. You can choose from ['cora', 'citeseer', 'pubmed'].\n\nnum_layer: int\n    number of hidden layers\n\nnum_hidden: int\n    number of the hidden units in the hidden layer\n\ninfeat_dim: int\n    dimension of the input features\n\nnum_classes: int\n    dimension of model output (Number of classes)\n\"\"\"\ndataset = \"cora\"\ng, data = load_dataset(dataset)\n\nnum_layers = 1\nnum_hidden = 16\ninfeat_dim = data.feat [...]
+        "\"\"\"\nParameters\n----------\ndataset: str\n    Name of dataset. You can choose from ['cora', 'citeseer', 'pubmed'].\n\nnum_layer: int\n    number of hidden layers\n\nnum_hidden: int\n    number of the hidden units in the hidden layer\n\ninfeat_dim: int\n    dimension of the input features\n\nnum_classes: int\n    dimension of model output (Number of classes)\n\"\"\"\n\ndataset = \"cora\"\ng, data = load_dataset(dataset)\n\nnum_layers = 1\nnum_hidden = 16\ninfeat_dim = data.fe [...]
       ]
     },
     {
diff --git a/docs/_downloads/8472bea81cf679760d7e4e77e895726f/extern_op.ipynb b/docs/_downloads/8472bea81cf679760d7e4e77e895726f/extern_op.ipynb
index 0b45bf898..633027b3a 100644
--- a/docs/_downloads/8472bea81cf679760d7e4e77e895726f/extern_op.ipynb
+++ b/docs/_downloads/8472bea81cf679760d7e4e77e895726f/extern_op.ipynb
@@ -26,7 +26,7 @@
       },
       "outputs": [],
       "source": [
-        "from __future__ import absolute_import, print_function\n\nimport tvm\nfrom tvm import te\nimport numpy as np\nfrom tvm.contrib import cblas\nimport tvm.testing\n\nif not tvm.get_global_func(\"tvm.contrib.cblas.matmul\", allow_missing=True):\n    raise Exception(\"Not compiled with cblas support; can't build this tutorial\")"
+        "from __future__ import absolute_import, print_function\n\n\nimport tvm\nfrom tvm import te\nimport numpy as np\nfrom tvm.contrib import cblas\nimport tvm.testing\n\nif not tvm.get_global_func(\"tvm.contrib.cblas.matmul\", allow_missing=True):\n    raise Exception(\"Not compiled with cblas support; can't build this tutorial\")"
       ]
     },
     {
diff --git a/docs/_downloads/854257a66df713b1f3f82eb3577f95e3/opt_conv_cuda.ipynb b/docs/_downloads/854257a66df713b1f3f82eb3577f95e3/opt_conv_cuda.ipynb
index a7ae407a9..d1f99f4d9 100644
--- a/docs/_downloads/854257a66df713b1f3f82eb3577f95e3/opt_conv_cuda.ipynb
+++ b/docs/_downloads/854257a66df713b1f3f82eb3577f95e3/opt_conv_cuda.ipynb
@@ -18,6 +18,17 @@
         "\n\n# How to optimize convolution on GPU\n**Author**: [Haichen Shen](https://homes.cs.washington.edu/~haichen/)\n\nIn this tutorial, we will demonstrate how to write a high performance\nconvolution implementation in TVM. We use square size input tensors and filters\nas an example, and assume the input to convolution has a large batch. In this\nexample, we use a different layout to store the data in order to achieve better\ndata locality. The buffer layout is HWCN, which stands f [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/8c7d8fd6a4b93bcff1f5573943dd02f4/scan.py b/docs/_downloads/8c7d8fd6a4b93bcff1f5573943dd02f4/scan.py
index 3f3d7e91e..d21673acd 100644
--- a/docs/_downloads/8c7d8fd6a4b93bcff1f5573943dd02f4/scan.py
+++ b/docs/_downloads/8c7d8fd6a4b93bcff1f5573943dd02f4/scan.py
@@ -24,6 +24,12 @@ Recurrent computing is a typical pattern in neural networks.
 """
 from __future__ import absolute_import, print_function
 
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 import tvm.testing
 from tvm import te
diff --git a/docs/_downloads/8d55b8f991fb704002f768367ce2d1a2/tvmc_python.ipynb b/docs/_downloads/8d55b8f991fb704002f768367ce2d1a2/tvmc_python.ipynb
index 3b4509abc..fbf1f374c 100644
--- a/docs/_downloads/8d55b8f991fb704002f768367ce2d1a2/tvmc_python.ipynb
+++ b/docs/_downloads/8d55b8f991fb704002f768367ce2d1a2/tvmc_python.ipynb
@@ -18,6 +18,17 @@
         "\n# Getting Starting using TVMC Python: a high-level API for TVM\n**Author**:\n[Jocelyn Shiue](https://github.com/CircleSpin)\n\nHi! Here we explain the scripting tool designed for the complete TVM beginner. \ud83d\ude42                                                                                                      \n\nBefore we get started let's get an example model if you don't already have one.\nFollow the steps to download a resnet model via the terminal:\n\n```python\n [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/8e7bbc9dbdda76ac573b24606b41c006/autotvm_matmul_x86.py b/docs/_downloads/8e7bbc9dbdda76ac573b24606b41c006/autotvm_matmul_x86.py
index b84a6193c..ebdbacb22 100644
--- a/docs/_downloads/8e7bbc9dbdda76ac573b24606b41c006/autotvm_matmul_x86.py
+++ b/docs/_downloads/8e7bbc9dbdda76ac573b24606b41c006/autotvm_matmul_x86.py
@@ -45,6 +45,12 @@ workflow is illustrated by a matrix multiplication example.
   :code:`if __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ################################################################################
 # Install dependencies
 # --------------------
diff --git a/docs/_downloads/96137df89d8034b548f407123ec50ce9/opt_gemm.py b/docs/_downloads/96137df89d8034b548f407123ec50ce9/opt_gemm.py
index 920d7a87f..d2ec711c2 100644
--- a/docs/_downloads/96137df89d8034b548f407123ec50ce9/opt_gemm.py
+++ b/docs/_downloads/96137df89d8034b548f407123ec50ce9/opt_gemm.py
@@ -48,6 +48,12 @@ All the experiment results mentioned below, are executed on 2015's 15' MacBook e
 Intel i7-4770HQ CPU. The cache line size should be 64 bytes for all the x86 CPUs.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ################################################################################################
 # Preparation and Baseline
 # ------------------------
diff --git a/docs/_downloads/9c3764c88ab3eb57dc223b4eda1e8a2f/deploy_sparse.py b/docs/_downloads/9c3764c88ab3eb57dc223b4eda1e8a2f/deploy_sparse.py
index 56a5f1aaf..b9a26e0d3 100644
--- a/docs/_downloads/9c3764c88ab3eb57dc223b4eda1e8a2f/deploy_sparse.py
+++ b/docs/_downloads/9c3764c88ab3eb57dc223b4eda1e8a2f/deploy_sparse.py
@@ -70,6 +70,12 @@ sparsity. A fun exercise is comparing the real speed of PruneBert with the block
 sparse speed using fake weights to see the benefit of structured sparsity.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ###############################################################################
 # Load Required Modules
 # ---------------------
diff --git a/docs/_downloads/9ccca8fd489a1486ac71b55a55c320c5/micro_autotune.py b/docs/_downloads/9ccca8fd489a1486ac71b55a55c320c5/micro_autotune.py
index 613d92e14..58c52508b 100644
--- a/docs/_downloads/9ccca8fd489a1486ac71b55a55c320c5/micro_autotune.py
+++ b/docs/_downloads/9ccca8fd489a1486ac71b55a55c320c5/micro_autotune.py
@@ -27,6 +27,12 @@ Autotuning with microTVM
 This tutorial explains how to autotune a model using the C runtime.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import os
 import json
 import numpy as np
diff --git a/docs/_downloads/9f81bc348ac4107d0670f512b8943a99/introduction.ipynb b/docs/_downloads/9f81bc348ac4107d0670f512b8943a99/introduction.ipynb
index 6cc0fd585..fdae47d1c 100644
--- a/docs/_downloads/9f81bc348ac4107d0670f512b8943a99/introduction.ipynb
+++ b/docs/_downloads/9f81bc348ac4107d0670f512b8943a99/introduction.ipynb
@@ -18,6 +18,17 @@
         "\n# Introduction\n**Authors**:\n[Jocelyn Shiue](https://github.com/),\n[Chris Hoge](https://github.com/hogepodge),\n[Lianmin Zheng](https://github.com/merrymercy)\n\nApache TVM is an open source machine learning compiler framework for CPUs,\nGPUs, and machine learning accelerators. It aims to enable machine learning\nengineers to optimize and run computations efficiently on any hardware backend.\nThe purpose of this tutorial is to take a guided tour through all of the major\nfea [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/a1417396e306d987107a7a39376ec261/tuple_inputs.ipynb b/docs/_downloads/a1417396e306d987107a7a39376ec261/tuple_inputs.ipynb
index bbc4bf70f..97f59a163 100644
--- a/docs/_downloads/a1417396e306d987107a7a39376ec261/tuple_inputs.ipynb
+++ b/docs/_downloads/a1417396e306d987107a7a39376ec261/tuple_inputs.ipynb
@@ -26,7 +26,7 @@
       },
       "outputs": [],
       "source": [
-        "from __future__ import absolute_import, print_function\n\nimport tvm\nfrom tvm import te\nimport numpy as np"
+        "from __future__ import absolute_import, print_function\n\n\nimport tvm\nfrom tvm import te\nimport numpy as np"
       ]
     },
     {
diff --git a/docs/_downloads/a70662bf8dc171d3d17a3945bbbb02e3/from_tflite.py b/docs/_downloads/a70662bf8dc171d3d17a3945bbbb02e3/from_tflite.py
index b72040236..712269381 100644
--- a/docs/_downloads/a70662bf8dc171d3d17a3945bbbb02e3/from_tflite.py
+++ b/docs/_downloads/a70662bf8dc171d3d17a3945bbbb02e3/from_tflite.py
@@ -52,6 +52,12 @@ Now please check if TFLite package is installed successfully, ``python -c "impor
 
 Below you can find an example on how to compile TFLite model using TVM.
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 ######################################################################
 # Utils for downloading and extracting zip files
 # ----------------------------------------------
diff --git a/docs/_downloads/ab2eef18d10188532645b1d60fc7dd68/micro_ethosu.py b/docs/_downloads/ab2eef18d10188532645b1d60fc7dd68/micro_ethosu.py
index f55fad71d..8e37a0ea5 100644
--- a/docs/_downloads/ab2eef18d10188532645b1d60fc7dd68/micro_ethosu.py
+++ b/docs/_downloads/ab2eef18d10188532645b1d60fc7dd68/micro_ethosu.py
@@ -37,6 +37,12 @@ In this tutorial, we will be compiling a MobileNet v1 model and instructing
 TVM to offload operators to the Ethos(TM)-U55 where possible.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ################################################################################
 # Obtaining TVM
 # -------------
diff --git a/docs/_downloads/b11795df0596a55e4982bf895d0c8c38/bring_your_own_datatypes.ipynb b/docs/_downloads/b11795df0596a55e4982bf895d0c8c38/bring_your_own_datatypes.ipynb
index f33787237..091930b2e 100644
--- a/docs/_downloads/b11795df0596a55e4982bf895d0c8c38/bring_your_own_datatypes.ipynb
+++ b/docs/_downloads/b11795df0596a55e4982bf895d0c8c38/bring_your_own_datatypes.ipynb
@@ -18,6 +18,17 @@
         "\n# Bring Your Own Datatypes to TVM\n**Authors**: [Gus Smith](https://github.com/gussmith23), [Andrew Liu](https://github.com/hypercubestart)\n\nIn this tutorial, we will show you how to utilize the Bring Your Own Datatypes framework to use your own custom datatypes in TVM.\nNote that the Bring Your Own Datatypes framework currently only handles **software emulated versions of datatypes**.\nThe framework does not support compiling for custom accelerator datatypes out-of-the-box. [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/b78f1a6e1b2c2fb073a791dc258a1d7d/schedule_primitives.ipynb b/docs/_downloads/b78f1a6e1b2c2fb073a791dc258a1d7d/schedule_primitives.ipynb
index 5dcd76e61..e9c9579ee 100644
--- a/docs/_downloads/b78f1a6e1b2c2fb073a791dc258a1d7d/schedule_primitives.ipynb
+++ b/docs/_downloads/b78f1a6e1b2c2fb073a791dc258a1d7d/schedule_primitives.ipynb
@@ -26,7 +26,7 @@
       },
       "outputs": [],
       "source": [
-        "from __future__ import absolute_import, print_function\n\nimport tvm\nfrom tvm import te\nimport numpy as np"
+        "from __future__ import absolute_import, print_function\n\n\nimport tvm\nfrom tvm import te\nimport numpy as np"
       ]
     },
     {
diff --git a/docs/_downloads/bc33c0d33026b287306b6ead1a50b04a/tune_relay_arm.ipynb b/docs/_downloads/bc33c0d33026b287306b6ead1a50b04a/tune_relay_arm.ipynb
index 7b1cdf662..13b05a179 100644
--- a/docs/_downloads/bc33c0d33026b287306b6ead1a50b04a/tune_relay_arm.ipynb
+++ b/docs/_downloads/bc33c0d33026b287306b6ead1a50b04a/tune_relay_arm.ipynb
@@ -18,6 +18,17 @@
         "\n\n# Auto-tuning a Convolutional Network for ARM CPU\n**Author**: [Lianmin Zheng](https://github.com/merrymercy), [Zhao Wu](https://github.com/FrozenGene), [Eddie Yan](https://github.com/eqy)\n\nAuto-tuning for a specific ARM device is critical for getting the best\nperformance. This is a tutorial about how to tune a whole convolutional\nnetwork.\n\nThe operator implementation for ARM CPU in TVM is written in template form.\nThe template has many tunable knobs (tile factor, vec [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/c20f81a94729f461f33b52cc110fd9d6/deploy_prequantized.ipynb b/docs/_downloads/c20f81a94729f461f33b52cc110fd9d6/deploy_prequantized.ipynb
index fd50af2be..c045c5023 100644
--- a/docs/_downloads/c20f81a94729f461f33b52cc110fd9d6/deploy_prequantized.ipynb
+++ b/docs/_downloads/c20f81a94729f461f33b52cc110fd9d6/deploy_prequantized.ipynb
@@ -18,6 +18,17 @@
         "\n# Deploy a Framework-prequantized Model with TVM\n**Author**: [Masahiro Masuda](https://github.com/masahi)\n\nThis is a tutorial on loading models quantized by deep learning frameworks into TVM.\nPre-quantized model import is one of the quantization support we have in TVM. More details on\nthe quantization story in TVM can be found\n[here](https://discuss.tvm.apache.org/t/quantization-story/3920).\n\nHere, we demonstrate how to load and run models quantized by PyTorch, MXNet,  [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/c23f7654585d9b0fa2129e1765b2a8f2/from_keras.py b/docs/_downloads/c23f7654585d9b0fa2129e1765b2a8f2/from_keras.py
index 1db27799f..895a601ad 100644
--- a/docs/_downloads/c23f7654585d9b0fa2129e1765b2a8f2/from_keras.py
+++ b/docs/_downloads/c23f7654585d9b0fa2129e1765b2a8f2/from_keras.py
@@ -34,6 +34,12 @@ A quick solution is to install via pip
 or please refer to official site
 https://keras.io/#installation
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 from tvm import te
 import tvm.relay as relay
diff --git a/docs/_downloads/c253040abc62eace272e406b7e1a4df5/tedd.py b/docs/_downloads/c253040abc62eace272e406b7e1a4df5/tedd.py
index 34ad43c22..7cb24f433 100644
--- a/docs/_downloads/c253040abc62eace272e406b7e1a4df5/tedd.py
+++ b/docs/_downloads/c253040abc62eace272e406b7e1a4df5/tedd.py
@@ -37,6 +37,12 @@ TEDD renders these three graphs from a given schedule.  This tutorial demonstrat
 how to use TEDD and how to interpret the rendered graphs.
 
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 from tvm import te
 from tvm import topi
diff --git a/docs/_downloads/caa649473e845a115a0397a2855fd356/low_level_custom_pass.py b/docs/_downloads/caa649473e845a115a0397a2855fd356/low_level_custom_pass.py
index ee96d8220..0f99c72ce 100644
--- a/docs/_downloads/caa649473e845a115a0397a2855fd356/low_level_custom_pass.py
+++ b/docs/_downloads/caa649473e845a115a0397a2855fd356/low_level_custom_pass.py
@@ -40,6 +40,12 @@ Before reading this tutorial, we assume readers have already known these topics
   take a look at ``python/tvm/build_module.py`` to get some basics.
 
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 from tvm import te
 import numpy as np
diff --git a/docs/_downloads/cb089f2129f9829a01cc54eb81528811/using_relay_viz.py b/docs/_downloads/cb089f2129f9829a01cc54eb81528811/using_relay_viz.py
index b0132f40b..2e68ce902 100644
--- a/docs/_downloads/cb089f2129f9829a01cc54eb81528811/using_relay_viz.py
+++ b/docs/_downloads/cb089f2129f9829a01cc54eb81528811/using_relay_viz.py
@@ -35,6 +35,12 @@ We will introduce how to implement customized parsers and renderers through inte
 
 For more details, please refer to :py:mod:`tvm.contrib.relay_viz`.
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 from typing import (
     Dict,
     Union,
diff --git a/docs/_downloads/cc6d9aebd24d54d81752590cbc8f99f9/relay_quick_start.py b/docs/_downloads/cc6d9aebd24d54d81752590cbc8f99f9/relay_quick_start.py
index fd7f5aa9d..8910817c2 100644
--- a/docs/_downloads/cc6d9aebd24d54d81752590cbc8f99f9/relay_quick_start.py
+++ b/docs/_downloads/cc6d9aebd24d54d81752590cbc8f99f9/relay_quick_start.py
@@ -26,6 +26,12 @@ generates a runtime library for Nvidia GPU with TVM.
 Notice that you need to build TVM with cuda and llvm enabled.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ######################################################################
 # Overview for Supported Hardware Backend of TVM
 # ----------------------------------------------
diff --git a/docs/_downloads/cccb17d28e5e8b2e94ea8cd5ec59f6ed/deploy_ssd_gluoncv.py b/docs/_downloads/cccb17d28e5e8b2e94ea8cd5ec59f6ed/deploy_ssd_gluoncv.py
index ebe18670c..f39244a2e 100644
--- a/docs/_downloads/cccb17d28e5e8b2e94ea8cd5ec59f6ed/deploy_ssd_gluoncv.py
+++ b/docs/_downloads/cccb17d28e5e8b2e94ea8cd5ec59f6ed/deploy_ssd_gluoncv.py
@@ -23,6 +23,12 @@ Deploy Single Shot Multibox Detector(SSD) model
 This article is an introductory tutorial to deploy SSD models with TVM.
 We will use GluonCV pre-trained SSD model and convert it to Relay IR
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 from tvm import te
 
diff --git a/docs/_downloads/d0a1817b910da958b41d88afe4d4952d/use_pass_instrument.py b/docs/_downloads/d0a1817b910da958b41d88afe4d4952d/use_pass_instrument.py
index 036aa63e3..3079e2f0e 100644
--- a/docs/_downloads/d0a1817b910da958b41d88afe4d4952d/use_pass_instrument.py
+++ b/docs/_downloads/d0a1817b910da958b41d88afe4d4952d/use_pass_instrument.py
@@ -33,6 +33,12 @@ but an extension mechanism is available via the :py:func:`tvm.instrument.pass_in
 This tutorial demonstrates how developers can use ``PassContext`` to instrument
 passes. Please also refer to the :ref:`pass-infra`.
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 import tvm.relay as relay
 from tvm.relay.testing import resnet
diff --git a/docs/_downloads/d1434e80dd27eef6b1c9cbaa13f1197b/tune_relay_cuda.ipynb b/docs/_downloads/d1434e80dd27eef6b1c9cbaa13f1197b/tune_relay_cuda.ipynb
index 198fc9d23..c53ddf5e4 100644
--- a/docs/_downloads/d1434e80dd27eef6b1c9cbaa13f1197b/tune_relay_cuda.ipynb
+++ b/docs/_downloads/d1434e80dd27eef6b1c9cbaa13f1197b/tune_relay_cuda.ipynb
@@ -18,6 +18,17 @@
         "\n# Auto-tuning a Convolutional Network for NVIDIA GPU\n**Author**: [Lianmin Zheng](https://github.com/merrymercy), [Eddie Yan](https://github.com/eqy/)\n\nAuto-tuning for specific devices and workloads is critical for getting the\nbest performance. This is a tutorial on how to tune a whole convolutional\nnetwork for NVIDIA GPU.\n\nThe operator implementation for NVIDIA GPU in TVM is written in template form.\nThe template has many tunable knobs (tile factor, unrolling, etc).\nW [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/d8509b0a8e7db9031303c1a1f6fd1e70/using_external_lib.py b/docs/_downloads/d8509b0a8e7db9031303c1a1f6fd1e70/using_external_lib.py
index 8b6957d1d..c018ee13c 100644
--- a/docs/_downloads/d8509b0a8e7db9031303c1a1f6fd1e70/using_external_lib.py
+++ b/docs/_downloads/d8509b0a8e7db9031303c1a1f6fd1e70/using_external_lib.py
@@ -31,6 +31,12 @@ For example, to use cuDNN, USE_CUDNN option in `cmake/config.cmake` needs to be
 
 To begin with, we import Relay and TVM.
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 from tvm import te
 import numpy as np
diff --git a/docs/_downloads/d9089082842c138d4c81335f88c60c82/intrin_math.py b/docs/_downloads/d9089082842c138d4c81335f88c60c82/intrin_math.py
index 535563bfb..5a8732abd 100644
--- a/docs/_downloads/d9089082842c138d4c81335f88c60c82/intrin_math.py
+++ b/docs/_downloads/d9089082842c138d4c81335f88c60c82/intrin_math.py
@@ -29,7 +29,13 @@ how we can invoke these target specific functions, and how we can unify
 the interface via TVM's intrinsic API.
 """
 from __future__ import absolute_import, print_function
-import numpy as np
+
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignoreimport numpy as np
 
 import tvm
 from tvm import te
diff --git a/docs/_downloads/da47fa2ad30c4b6921171c97e72f36a9/schedule_primitives.py b/docs/_downloads/da47fa2ad30c4b6921171c97e72f36a9/schedule_primitives.py
index 65fdeda57..af67ed152 100644
--- a/docs/_downloads/da47fa2ad30c4b6921171c97e72f36a9/schedule_primitives.py
+++ b/docs/_downloads/da47fa2ad30c4b6921171c97e72f36a9/schedule_primitives.py
@@ -28,6 +28,12 @@ various primitives provided by TVM.
 """
 from __future__ import absolute_import, print_function
 
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import tvm
 from tvm import te
 import numpy as np
diff --git a/docs/_downloads/dabb6b43ea9ef9d7bd1a3912001deace/build_gcn.py b/docs/_downloads/dabb6b43ea9ef9d7bd1a3912001deace/build_gcn.py
index fcffbd77f..8953ffc2e 100644
--- a/docs/_downloads/dabb6b43ea9ef9d7bd1a3912001deace/build_gcn.py
+++ b/docs/_downloads/dabb6b43ea9ef9d7bd1a3912001deace/build_gcn.py
@@ -118,6 +118,12 @@ infeat_dim: int
 num_classes: int
     dimension of model output (Number of classes)
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 dataset = "cora"
 g, data = load_dataset(dataset)
 
diff --git a/docs/_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py b/docs/_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py
index a4f7e22d8..5d173e381 100644
--- a/docs/_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py
+++ b/docs/_downloads/e3e540f3b477c0c52d8eb73e674e8ffd/tune_conv2d_layer_cuda.py
@@ -37,6 +37,12 @@ get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import os
 
 import numpy as np
diff --git a/docs/_downloads/e416b94ca1090b0897c0f6e0df95b911/tune_network_x86.py b/docs/_downloads/e416b94ca1090b0897c0f6e0df95b911/tune_network_x86.py
index 6cb8d6f14..5a321104c 100644
--- a/docs/_downloads/e416b94ca1090b0897c0f6e0df95b911/tune_network_x86.py
+++ b/docs/_downloads/e416b94ca1090b0897c0f6e0df95b911/tune_network_x86.py
@@ -45,6 +45,12 @@ get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import numpy as np
 
 import tvm
diff --git a/docs/_downloads/eac4389b114db015e95cb3cdf8b86b83/auto_scheduler_matmul_x86.py b/docs/_downloads/eac4389b114db015e95cb3cdf8b86b83/auto_scheduler_matmul_x86.py
index b9f89f672..279987f00 100644
--- a/docs/_downloads/eac4389b114db015e95cb3cdf8b86b83/auto_scheduler_matmul_x86.py
+++ b/docs/_downloads/eac4389b114db015e95cb3cdf8b86b83/auto_scheduler_matmul_x86.py
@@ -38,6 +38,12 @@ We use matrix multiplication as an example in this tutorial.
   __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import os
 
 import numpy as np
diff --git a/docs/_downloads/eafe360d52540634c9eea0fa89e804bd/tune_network_cuda.py b/docs/_downloads/eafe360d52540634c9eea0fa89e804bd/tune_network_cuda.py
index b403c0aa8..cc29f27ba 100644
--- a/docs/_downloads/eafe360d52540634c9eea0fa89e804bd/tune_network_cuda.py
+++ b/docs/_downloads/eafe360d52540634c9eea0fa89e804bd/tune_network_cuda.py
@@ -44,6 +44,12 @@ get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import numpy as np
 
 import tvm
diff --git a/docs/_downloads/eb551cfff8900ec35fae9f15aa728e45/from_onnx.py b/docs/_downloads/eb551cfff8900ec35fae9f15aa728e45/from_onnx.py
index 586c811aa..f0256bc7d 100644
--- a/docs/_downloads/eb551cfff8900ec35fae9f15aa728e45/from_onnx.py
+++ b/docs/_downloads/eb551cfff8900ec35fae9f15aa728e45/from_onnx.py
@@ -32,6 +32,12 @@ A quick solution is to install protobuf compiler, and
 or please refer to official site.
 https://github.com/onnx/onnx
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import onnx
 import numpy as np
 import tvm
diff --git a/docs/_downloads/ee99205e9f2e4f54c0fb7925008a5354/bring_your_own_datatypes.py b/docs/_downloads/ee99205e9f2e4f54c0fb7925008a5354/bring_your_own_datatypes.py
index 1a48781e2..479269a22 100644
--- a/docs/_downloads/ee99205e9f2e4f54c0fb7925008a5354/bring_your_own_datatypes.py
+++ b/docs/_downloads/ee99205e9f2e4f54c0fb7925008a5354/bring_your_own_datatypes.py
@@ -52,6 +52,12 @@ If you would like to try this with your own datatype library, first bring the li
     ctypes.CDLL('my-datatype-lib.so', ctypes.RTLD_GLOBAL)
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 ######################
 # A Simple TVM Program
 # --------------------
diff --git a/docs/_downloads/efe0b02e219b28e0bd85fbdda35ba8ac/tvmc_command_line_driver.ipynb b/docs/_downloads/efe0b02e219b28e0bd85fbdda35ba8ac/tvmc_command_line_driver.ipynb
index 0ba721025..a197f86be 100644
--- a/docs/_downloads/efe0b02e219b28e0bd85fbdda35ba8ac/tvmc_command_line_driver.ipynb
+++ b/docs/_downloads/efe0b02e219b28e0bd85fbdda35ba8ac/tvmc_command_line_driver.ipynb
@@ -18,6 +18,17 @@
         "\n# Compiling and Optimizing a Model with TVMC\n**Authors**:\n[Leandro Nunes](https://github.com/leandron),\n[Matthew Barrett](https://github.com/mbaret),\n[Chris Hoge](https://github.com/hogepodge)\n\nIn this section, we will work with TVMC, the TVM command line driver. TVMC is a\ntool that exposes TVM features such as auto-tuning, compiling, profiling and\nexecution of models through a command line interface.\n\nUpon completion of this section, we will have used TVMC to accomp [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/f289ca2466fcf79c024068c1f8642bd0/cross_compilation_and_rpc.ipynb b/docs/_downloads/f289ca2466fcf79c024068c1f8642bd0/cross_compilation_and_rpc.ipynb
index 8096930fe..c3af7d763 100644
--- a/docs/_downloads/f289ca2466fcf79c024068c1f8642bd0/cross_compilation_and_rpc.ipynb
+++ b/docs/_downloads/f289ca2466fcf79c024068c1f8642bd0/cross_compilation_and_rpc.ipynb
@@ -18,6 +18,17 @@
         "\n\n# Cross Compilation and RPC\n**Author**: [Ziheng Jiang](https://github.com/ZihengJiang/), [Lianmin Zheng](https://github.com/merrymercy/)\n\nThis tutorial introduces cross compilation and remote device\nexecution with RPC in TVM.\n\nWith cross compilation and RPC, you can **compile a program on your\nlocal machine then run it on the remote device**. It is useful when\nthe remote device resource are limited, like Raspberry Pi and mobile\nplatforms. In this tutorial, we will u [...]
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
diff --git a/docs/_downloads/f7ae979fbe61064749ce0fb7a621eb4c/from_oneflow.py b/docs/_downloads/f7ae979fbe61064749ce0fb7a621eb4c/from_oneflow.py
index f92f0b0f1..eb27c4b3e 100644
--- a/docs/_downloads/f7ae979fbe61064749ce0fb7a621eb4c/from_oneflow.py
+++ b/docs/_downloads/f7ae979fbe61064749ce0fb7a621eb4c/from_oneflow.py
@@ -35,6 +35,12 @@ https://github.com/Oneflow-Inc/oneflow
 
 Currently, TVM supports OneFlow 0.7.0. Other versions may be unstable.
 """
+
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
 import os, math
 from matplotlib import pyplot as plt
 import numpy as np
diff --git a/docs/_downloads/f90d5f6bfd99e0d9812ae5b91503e148/from_pytorch.py b/docs/_downloads/f90d5f6bfd99e0d9812ae5b91503e148/from_pytorch.py
index e8d0b4998..98b531fa6 100644
--- a/docs/_downloads/f90d5f6bfd99e0d9812ae5b91503e148/from_pytorch.py
+++ b/docs/_downloads/f90d5f6bfd99e0d9812ae5b91503e148/from_pytorch.py
@@ -41,6 +41,12 @@ Currently, TVM supports PyTorch 1.7 and 1.4. Other versions may
 be unstable.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 import tvm
 from tvm import relay
 
diff --git a/docs/_downloads/fb8217c13f4351224c6cf3aacf1a87fc/deploy_prequantized.py b/docs/_downloads/fb8217c13f4351224c6cf3aacf1a87fc/deploy_prequantized.py
index caee2b3b4..fdb4de289 100644
--- a/docs/_downloads/fb8217c13f4351224c6cf3aacf1a87fc/deploy_prequantized.py
+++ b/docs/_downloads/fb8217c13f4351224c6cf3aacf1a87fc/deploy_prequantized.py
@@ -28,6 +28,12 @@ Here, we demonstrate how to load and run models quantized by PyTorch, MXNet, and
 Once loaded, we can run compiled, quantized models on any hardware TVM supports.
 """
 
+# sphinx_gallery_start_ignore
+from tvm import testing
+
+testing.utils.install_request_hook(depth=3)
+# sphinx_gallery_end_ignore
+
 #################################################################################
 # First, necessary imports
 from PIL import Image
diff --git a/docs/_sources/how_to/compile_models/from_coreml.rst.txt b/docs/_sources/how_to/compile_models/from_coreml.rst.txt
index 220de9daf..db6405001 100644
--- a/docs/_sources/how_to/compile_models/from_coreml.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_coreml.rst.txt
@@ -35,10 +35,11 @@ A quick solution is to install via pip
 or please refer to official site
 https://github.com/apple/coremltools
 
-.. GENERATED FROM PYTHON SOURCE LINES 37-45
+.. GENERATED FROM PYTHON SOURCE LINES 37-46
 
 .. code-block:: default
 
+
     import tvm
     from tvm import te
     import tvm.relay as relay
@@ -54,14 +55,14 @@ https://github.com/apple/coremltools
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 46-50
+.. GENERATED FROM PYTHON SOURCE LINES 52-56
 
 Load pretrained CoreML model
 ----------------------------
 We will download and load a pretrained mobilenet classification network
 provided by apple in this example
 
-.. GENERATED FROM PYTHON SOURCE LINES 50-56
+.. GENERATED FROM PYTHON SOURCE LINES 56-62
 
 .. code-block:: default
 
@@ -78,13 +79,13 @@ provided by apple in this example
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 57-60
+.. GENERATED FROM PYTHON SOURCE LINES 63-66
 
 Load a test image
 ------------------
 A single cat dominates the examples!
 
-.. GENERATED FROM PYTHON SOURCE LINES 60-67
+.. GENERATED FROM PYTHON SOURCE LINES 66-73
 
 .. code-block:: default
 
@@ -102,13 +103,13 @@ A single cat dominates the examples!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 68-71
+.. GENERATED FROM PYTHON SOURCE LINES 74-77
 
 Compile the model on Relay
 ---------------------------
 We should be familiar with the process right now.
 
-.. GENERATED FROM PYTHON SOURCE LINES 71-80
+.. GENERATED FROM PYTHON SOURCE LINES 77-86
 
 .. code-block:: default
 
@@ -135,13 +136,13 @@ We should be familiar with the process right now.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 81-84
+.. GENERATED FROM PYTHON SOURCE LINES 87-90
 
 Execute on TVM
 -------------------
 The process is no different from other example
 
-.. GENERATED FROM PYTHON SOURCE LINES 84-97
+.. GENERATED FROM PYTHON SOURCE LINES 90-103
 
 .. code-block:: default
 
@@ -165,13 +166,13 @@ The process is no different from other example
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 98-101
+.. GENERATED FROM PYTHON SOURCE LINES 104-107
 
 Look up synset name
 -------------------
 Look up prediction top 1 index in 1000 class synset.
 
-.. GENERATED FROM PYTHON SOURCE LINES 101-115
+.. GENERATED FROM PYTHON SOURCE LINES 107-121
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/compile_models/from_darknet.rst.txt b/docs/_sources/how_to/compile_models/from_darknet.rst.txt
index 0a427baa7..c05aaa119 100644
--- a/docs/_sources/how_to/compile_models/from_darknet.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_darknet.rst.txt
@@ -33,11 +33,12 @@ Please install CFFI and CV2 before executing this script
   pip install cffi
   pip install opencv-python
 
-.. GENERATED FROM PYTHON SOURCE LINES 33-49
+.. GENERATED FROM PYTHON SOURCE LINES 33-50
 
 .. code-block:: default
 
 
+
     # numpy and matplotlib
     import numpy as np
     import matplotlib.pyplot as plt
@@ -60,13 +61,13 @@ Please install CFFI and CV2 before executing this script
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 50-53
+.. GENERATED FROM PYTHON SOURCE LINES 56-59
 
 Choose the model
 -----------------------
 Models are: 'yolov2', 'yolov3' or 'yolov3-tiny'
 
-.. GENERATED FROM PYTHON SOURCE LINES 53-57
+.. GENERATED FROM PYTHON SOURCE LINES 59-63
 
 .. code-block:: default
 
@@ -81,13 +82,13 @@ Models are: 'yolov2', 'yolov3' or 'yolov3-tiny'
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 58-61
+.. GENERATED FROM PYTHON SOURCE LINES 64-67
 
 Download required files
 -----------------------
 Download cfg and weights file if first time.
 
-.. GENERATED FROM PYTHON SOURCE LINES 61-93
+.. GENERATED FROM PYTHON SOURCE LINES 67-99
 
 .. code-block:: default
 
@@ -136,13 +137,13 @@ Download cfg and weights file if first time.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 94-97
+.. GENERATED FROM PYTHON SOURCE LINES 100-103
 
 Import the graph to Relay
 -------------------------
 compile the model
 
-.. GENERATED FROM PYTHON SOURCE LINES 97-106
+.. GENERATED FROM PYTHON SOURCE LINES 103-112
 
 .. code-block:: default
 
@@ -170,12 +171,12 @@ compile the model
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 107-109
+.. GENERATED FROM PYTHON SOURCE LINES 113-115
 
 Load a test image
 -----------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 109-115
+.. GENERATED FROM PYTHON SOURCE LINES 115-121
 
 .. code-block:: default
 
@@ -198,13 +199,13 @@ Load a test image
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 116-119
+.. GENERATED FROM PYTHON SOURCE LINES 122-125
 
 Execute on TVM Runtime
 ----------------------
 The process is no different from other examples.
 
-.. GENERATED FROM PYTHON SOURCE LINES 119-203
+.. GENERATED FROM PYTHON SOURCE LINES 125-209
 
 .. code-block:: default
 
@@ -314,11 +315,6 @@ The process is no different from other examples.
 
 
 
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 1 minutes  19.763 seconds)
-
-
 .. _sphx_glr_download_how_to_compile_models_from_darknet.py:
 
 .. only:: html
diff --git a/docs/_sources/how_to/compile_models/from_keras.rst.txt b/docs/_sources/how_to/compile_models/from_keras.rst.txt
index e589fad33..f86295a86 100644
--- a/docs/_sources/how_to/compile_models/from_keras.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_keras.rst.txt
@@ -37,10 +37,11 @@ A quick solution is to install via pip
 or please refer to official site
 https://keras.io/#installation
 
-.. GENERATED FROM PYTHON SOURCE LINES 37-45
+.. GENERATED FROM PYTHON SOURCE LINES 37-46
 
 .. code-block:: default
 
+
     import tvm
     from tvm import te
     import tvm.relay as relay
@@ -56,13 +57,13 @@ https://keras.io/#installation
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 46-49
+.. GENERATED FROM PYTHON SOURCE LINES 52-55
 
 Load pretrained keras model
 ----------------------------
 We load a pretrained resnet-50 classification model provided by keras.
 
-.. GENERATED FROM PYTHON SOURCE LINES 49-74
+.. GENERATED FROM PYTHON SOURCE LINES 55-80
 
 .. code-block:: default
 
@@ -98,13 +99,13 @@ We load a pretrained resnet-50 classification model provided by keras.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 75-78
+.. GENERATED FROM PYTHON SOURCE LINES 81-84
 
 Load a test image
 ------------------
 A single cat dominates the examples!
 
-.. GENERATED FROM PYTHON SOURCE LINES 78-92
+.. GENERATED FROM PYTHON SOURCE LINES 84-98
 
 .. code-block:: default
 
@@ -140,13 +141,13 @@ A single cat dominates the examples!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 93-96
+.. GENERATED FROM PYTHON SOURCE LINES 99-102
 
 Compile the model with Relay
 ----------------------------
 convert the keras model(NHWC layout) to Relay format(NCHW layout).
 
-.. GENERATED FROM PYTHON SOURCE LINES 96-110
+.. GENERATED FROM PYTHON SOURCE LINES 102-116
 
 .. code-block:: default
 
@@ -171,12 +172,12 @@ convert the keras model(NHWC layout) to Relay format(NCHW layout).
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 111-113
+.. GENERATED FROM PYTHON SOURCE LINES 117-119
 
 Execute on TVM
 ---------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 113-117
+.. GENERATED FROM PYTHON SOURCE LINES 119-123
 
 .. code-block:: default
 
@@ -191,13 +192,13 @@ Execute on TVM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 118-121
+.. GENERATED FROM PYTHON SOURCE LINES 124-127
 
 Look up synset name
 -------------------
 Look up prediction top 1 index in 1000 class synset.
 
-.. GENERATED FROM PYTHON SOURCE LINES 121-138
+.. GENERATED FROM PYTHON SOURCE LINES 127-144
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
index b24031c45..d9d715dc6 100644
--- a/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_mxnet.rst.txt
@@ -37,10 +37,11 @@ A quick solution is
 or please refer to official installation guide.
 https://mxnet.apache.org/versions/master/install/index.html
 
-.. GENERATED FROM PYTHON SOURCE LINES 38-44
+.. GENERATED FROM PYTHON SOURCE LINES 38-45
 
 .. code-block:: default
 
+
     # some standard imports
     import mxnet as mx
     import tvm
@@ -54,13 +55,13 @@ https://mxnet.apache.org/versions/master/install/index.html
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 45-48
+.. GENERATED FROM PYTHON SOURCE LINES 51-54
 
 Download Resnet18 model from Gluon Model Zoo
 ---------------------------------------------
 In this section, we download a pretrained imagenet model and classify an image.
 
-.. GENERATED FROM PYTHON SOURCE LINES 48-85
+.. GENERATED FROM PYTHON SOURCE LINES 54-91
 
 .. code-block:: default
 
@@ -114,13 +115,13 @@ In this section, we download a pretrained imagenet model and classify an image.
 
  .. code-block:: none
 
-    Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zip6ce158d8-d0bd-4cc7-b8f0-7345e113f345 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
+    Downloading /workspace/.mxnet/models/resnet18_v1-a0666292.zipc244bd1b-c077-448e-80fe-9d7f1978bba4 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
     x (1, 3, 224, 224)
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 86-91
+.. GENERATED FROM PYTHON SOURCE LINES 92-97
 
 Compile the Graph
 -----------------
@@ -128,7 +129,7 @@ Now we would like to port the Gluon model to a portable computational graph.
 It's as easy as several lines.
 We support MXNet static graph(symbol) and HybridBlock in mxnet.gluon
 
-.. GENERATED FROM PYTHON SOURCE LINES 91-97
+.. GENERATED FROM PYTHON SOURCE LINES 97-103
 
 .. code-block:: default
 
@@ -145,11 +146,11 @@ We support MXNet static graph(symbol) and HybridBlock in mxnet.gluon
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 98-99
+.. GENERATED FROM PYTHON SOURCE LINES 104-105
 
 now compile the graph
 
-.. GENERATED FROM PYTHON SOURCE LINES 99-103
+.. GENERATED FROM PYTHON SOURCE LINES 105-109
 
 .. code-block:: default
 
@@ -171,13 +172,13 @@ now compile the graph
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 104-107
+.. GENERATED FROM PYTHON SOURCE LINES 110-113
 
 Execute the portable graph on TVM
 ---------------------------------
 Now, we would like to reproduce the same forward computation using TVM.
 
-.. GENERATED FROM PYTHON SOURCE LINES 107-121
+.. GENERATED FROM PYTHON SOURCE LINES 113-127
 
 .. code-block:: default
 
@@ -208,14 +209,14 @@ Now, we would like to reproduce the same forward computation using TVM.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 122-126
+.. GENERATED FROM PYTHON SOURCE LINES 128-132
 
 Use MXNet symbol with pretrained weights
 ----------------------------------------
 MXNet often use `arg_params` and `aux_params` to store network parameters
 separately, here we show how to use these weights with existing API
 
-.. GENERATED FROM PYTHON SOURCE LINES 126-141
+.. GENERATED FROM PYTHON SOURCE LINES 132-147
 
 .. code-block:: default
 
@@ -241,11 +242,11 @@ separately, here we show how to use these weights with existing API
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 142-143
+.. GENERATED FROM PYTHON SOURCE LINES 148-149
 
 for a normal mxnet model, we start from here
 
-.. GENERATED FROM PYTHON SOURCE LINES 143-147
+.. GENERATED FROM PYTHON SOURCE LINES 149-153
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/compile_models/from_oneflow.rst.txt b/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
index 2b6da480f..e6fcd040b 100644
--- a/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_oneflow.rst.txt
@@ -38,10 +38,11 @@ https://github.com/Oneflow-Inc/oneflow
 
 Currently, TVM supports OneFlow 0.7.0. Other versions may be unstable.
 
-.. GENERATED FROM PYTHON SOURCE LINES 38-52
+.. GENERATED FROM PYTHON SOURCE LINES 38-53
 
 .. code-block:: default
 
+
     import os, math
     from matplotlib import pyplot as plt
     import numpy as np
@@ -86,12 +87,12 @@ Currently, TVM supports OneFlow 0.7.0. Other versions may be unstable.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 53-55
+.. GENERATED FROM PYTHON SOURCE LINES 59-61
 
 Load a pretrained OneFlow model and save model
 ----------------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 55-63
+.. GENERATED FROM PYTHON SOURCE LINES 61-69
 
 .. code-block:: default
 
@@ -112,18 +113,18 @@ Load a pretrained OneFlow model and save model
  .. code-block:: none
 
     Downloading: "https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/flowvision/classification/ResNet/resnet18.zip" to /workspace/.oneflow/flowvision_cache/resnet18.zip
-
      0%|          | 0.00/41.5M [00:00<?, ?B/s]
      0%|          | 16.0k/41.5M [00:00<08:01, 90.4kB/s]
      0%|          | 48.0k/41.5M [00:00<05:23, 134kB/s] 
      0%|          | 96.0k/41.5M [00:00<04:01, 180kB/s]
      0%|          | 168k/41.5M [00:00<02:48, 257kB/s] 
      1%|          | 336k/41.5M [00:00<01:27, 492kB/s]
      1%|1         | 536k/41.5M [00:01<01:03, 675kB/s]
      2%|2         | 960k/41.5M [00:01<00:37, 1.15MB/s]
      3%|3         | 1.37M/41.5M [00:01<00:28, 1.48MB/s]
      4%|4         | 1.82M/41.5M [00:01<00:22, 1.81MB/s]
      6%|5         | 2.29M/41.5M [00:01<00:20, 2.04MB/s]
      7%|6         | 2.79M/41.5M [00:02<00:18, 2.19MB/s]
      8%|7         | 3.30M/41.5M [00:02<00:17, 2.30MB/s]
      9%|9         | 3.86M/41.5M [00:02<00:15, 2.54MB/s]
     11%|#         | 4.43M/41.5M [00:02<00:14, 2.74MB/s]
     12%|#2        | 5.02M/41.5M [00:02<00:13, 2.83MB/s]
     14%|#3        | 5.64M/41.5M [00:03<00:11, 3.25MB/s]
     15%|#5        | 6.29M/41.5M [00:03<00:
 09, 3.87MB/s]
     16%|#6        | 6.70M/41.5M [00:03<00:10, 3.49MB/s]
     17%|#7        | 7.05M/41.5M [00:03<00:11, 3.04MB/s]
     18%|#8        | 7.67M/41.5M [00:03<00:11, 3.10MB/s]
     20%|##        | 8.41M/41.5M [00:03<00:10, 3.28MB/s]
     22%|##2       | 9.19M/41.5M [00:04<00:08, 4.14MB/s]
     23%|##3       | 9.64M/41.5M [00:04<00:07, 4.25MB/s]
     24%|##4       | 10.1M/41.5M [00:04<00:08, 3.70MB/s]
     26%|##6       | 10.9M/41.5M [00:04<00:08, 3.85MB/s]
     28%|##8       | 11.8M/41.5M [00:04<00:07, 4.06MB/s]
     31%|###       | 12.7M/41.5M [00:04<00:05, 5.14MB/s]
     32%|###1      | 13.2M/41.5M [00:04<00:05, 5.17MB/s]
     33%|###3      | 13.8M/41.5M [00:05<00:06, 4.52MB/s]
     35%|###5      | 14.7M/41.5M [00:05<00:06, 4.64MB/s]
     38%|###8      | 15.8M/41.5M [00:05<00:05, 4.88MB/s]
     41%|####      | 16.9M/41.5M [00:05<00:05, 5.13MB/s]
     44%|####3     | 18.1M/41.5M [00:05<00:04, 5.64MB/s]
     47%|####6     | 19.3M/41.5M [00:06<00:03, 5.96MB/s]
     50%|####9
      | 20.6M/41.5M [00:06<00:03, 6.11MB/s]
     53%|#####2    | 22.0M/41.5M [00:06<00:03, 6.29MB/s]
     56%|#####6    | 23.4M/41.5M [00:06<00:02, 6.82MB/s]
     60%|#####9    | 24.9M/41.5M [00:06<00:02, 7.21MB/s]
     63%|######3   | 26.3M/41.5M [00:07<00:02, 7.28MB/s]
     67%|######6   | 27.8M/41.5M [00:07<00:01, 7.22MB/s]
     70%|#######   | 29.2M/41.5M [00:07<00:01, 7.56MB/s]
     74%|#######4  | 30.7M/41.5M [00:07<00:01, 7.79MB/s]
     78%|#######7  | 32.2M/41.5M [00:07<00:01, 7.71MB/s]
     81%|########1 | 33.6M/41.5M [00:08<00:01, 7.52MB/s]
     85%|########4 | 35.1M/41.5M [00:08<00:00, 7.74MB/s]
     88%|########8 | 36.6M/41.5M [00:08<00:00, 7.94MB/s]
     92%|#########1| 38.1M/41.5M [00:08<00:00, 7.82MB/s]
     95%|#########5| 39.5M/41.5M [00:08<00:00, 7.60MB/s]
     99%|#########8| 41.0M/41.5M [00:09<00:00, 7.73MB/s]
    100%|##########| 41.5M/41.5M [00:09<00:00, 4.76MB/s]
+
      0%|          | 0.00/41.5M [00:00<?, ?B/s]
     15%|#5        | 6.33M/41.5M [00:00<00:00, 57.2MB/s]
     28%|##8       | 11.8M/41.5M [00:00<00:00, 38.6MB/s]
     39%|###8      | 16.0M/41.5M [00:00<00:00, 36.9MB/s]
     58%|#####7    | 24.0M/41.5M [00:00<00:00, 38.9MB/s]
     77%|#######7  | 32.0M/41.5M [00:00<00:00, 39.1MB/s]
     96%|#########6| 40.0M/41.5M [00:01<00:00, 41.5MB/s]
    100%|##########| 41.5M/41.5M [00:01<00:00, 41.9MB/s]
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 64-67
+.. GENERATED FROM PYTHON SOURCE LINES 70-73
 
 Load a test image
 -----------------
 Classic cat example!
 
-.. GENERATED FROM PYTHON SOURCE LINES 67-87
+.. GENERATED FROM PYTHON SOURCE LINES 73-93
 
 .. code-block:: default
 
@@ -154,13 +155,13 @@ Classic cat example!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 88-91
+.. GENERATED FROM PYTHON SOURCE LINES 94-97
 
 Import the graph to Relay
 -------------------------
 Convert OneFlow graph to Relay graph. The input name can be arbitrary.
 
-.. GENERATED FROM PYTHON SOURCE LINES 91-106
+.. GENERATED FROM PYTHON SOURCE LINES 97-112
 
 .. code-block:: default
 
@@ -186,13 +187,13 @@ Convert OneFlow graph to Relay graph. The input name can be arbitrary.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 107-110
+.. GENERATED FROM PYTHON SOURCE LINES 113-116
 
 Relay Build
 -----------
 Compile the graph to llvm target with given input specification.
 
-.. GENERATED FROM PYTHON SOURCE LINES 110-115
+.. GENERATED FROM PYTHON SOURCE LINES 116-121
 
 .. code-block:: default
 
@@ -215,13 +216,13 @@ Compile the graph to llvm target with given input specification.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 116-119
+.. GENERATED FROM PYTHON SOURCE LINES 122-125
 
 Execute the portable graph on TVM
 ---------------------------------
 Now we can try deploying the compiled model on target.
 
-.. GENERATED FROM PYTHON SOURCE LINES 119-127
+.. GENERATED FROM PYTHON SOURCE LINES 125-133
 
 .. code-block:: default
 
@@ -247,13 +248,13 @@ Now we can try deploying the compiled model on target.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 128-131
+.. GENERATED FROM PYTHON SOURCE LINES 134-137
 
 Look up synset name
 -------------------
 Look up prediction top 1 index in 1000 class synset.
 
-.. GENERATED FROM PYTHON SOURCE LINES 131-178
+.. GENERATED FROM PYTHON SOURCE LINES 137-184
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/compile_models/from_onnx.rst.txt b/docs/_sources/how_to/compile_models/from_onnx.rst.txt
index 5e6ffddbb..20b7c0807 100644
--- a/docs/_sources/how_to/compile_models/from_onnx.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_onnx.rst.txt
@@ -35,10 +35,11 @@ A quick solution is to install protobuf compiler, and
 or please refer to official site.
 https://github.com/onnx/onnx
 
-.. GENERATED FROM PYTHON SOURCE LINES 35-42
+.. GENERATED FROM PYTHON SOURCE LINES 35-43
 
 .. code-block:: default
 
+
     import onnx
     import numpy as np
     import tvm
@@ -53,7 +54,7 @@ https://github.com/onnx/onnx
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 43-48
+.. GENERATED FROM PYTHON SOURCE LINES 49-54
 
 Load pretrained ONNX model
 ---------------------------------------------
@@ -61,7 +62,7 @@ The example super resolution model used here is exactly the same model in onnx t
 http://pytorch.org/tutorials/advanced/super_resolution_with_caffe2.html
 we skip the pytorch model construction part, and download the saved onnx model
 
-.. GENERATED FROM PYTHON SOURCE LINES 48-60
+.. GENERATED FROM PYTHON SOURCE LINES 54-66
 
 .. code-block:: default
 
@@ -84,7 +85,7 @@ we skip the pytorch model construction part, and download the saved onnx model
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 61-68
+.. GENERATED FROM PYTHON SOURCE LINES 67-74
 
 Load a test image
 ---------------------------------------------
@@ -94,7 +95,7 @@ axis, a 672x672 image. Re-scale the cat image to fit this input shape then
 convert to `YCbCr`. The super resolution model will then be applied to the
 luminance (`Y`) channel.
 
-.. GENERATED FROM PYTHON SOURCE LINES 68-77
+.. GENERATED FROM PYTHON SOURCE LINES 74-83
 
 .. code-block:: default
 
@@ -114,7 +115,7 @@ luminance (`Y`) channel.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 78-88
+.. GENERATED FROM PYTHON SOURCE LINES 84-94
 
 Compile the model with relay
 ---------------------------------------------
@@ -127,7 +128,7 @@ Passing in the shape dictionary to the `relay.frontend.from_onnx` method
 tells relay which ONNX parameters are inputs, and which are parameters, and
 provides a static definition of the input size.
 
-.. GENERATED FROM PYTHON SOURCE LINES 88-99
+.. GENERATED FROM PYTHON SOURCE LINES 94-105
 
 .. code-block:: default
 
@@ -160,12 +161,12 @@ provides a static definition of the input size.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 100-102
+.. GENERATED FROM PYTHON SOURCE LINES 106-108
 
 Execute on TVM
 ---------------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 102-105
+.. GENERATED FROM PYTHON SOURCE LINES 108-111
 
 .. code-block:: default
 
@@ -179,7 +180,7 @@ Execute on TVM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 106-111
+.. GENERATED FROM PYTHON SOURCE LINES 112-117
 
 Display results
 ---------------------------------------------
@@ -187,7 +188,7 @@ We put input and output image neck to neck. The luminance channel, `Y` is the ou
 from the model. The chroma channels `Cb` and `Cr` are resized to match with a simple
 bicubic algorithm. The image is then recombined and converted back to `RGB`.
 
-.. GENERATED FROM PYTHON SOURCE LINES 111-123
+.. GENERATED FROM PYTHON SOURCE LINES 117-129
 
 .. code-block:: default
 
@@ -216,15 +217,15 @@ bicubic algorithm. The image is then recombined and converted back to `RGB`.
 
  .. code-block:: none
 
-    /workspace/gallery/how_to/compile_models/from_onnx.py:114: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
+    /workspace/gallery/how_to/compile_models/from_onnx.py:120: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
       out_cb = img_cb.resize(out_y.size, Image.BICUBIC)
-    /workspace/gallery/how_to/compile_models/from_onnx.py:115: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
+    /workspace/gallery/how_to/compile_models/from_onnx.py:121: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
       out_cr = img_cr.resize(out_y.size, Image.BICUBIC)
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 124-136
+.. GENERATED FROM PYTHON SOURCE LINES 130-142
 
 Notes
 ---------------------------------------------
diff --git a/docs/_sources/how_to/compile_models/from_paddle.rst.txt b/docs/_sources/how_to/compile_models/from_paddle.rst.txt
index 250857124..384f8cbea 100644
--- a/docs/_sources/how_to/compile_models/from_paddle.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_paddle.rst.txt
@@ -33,10 +33,11 @@ A quick solution is
 or please refer to official site.
 https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html
 
-.. GENERATED FROM PYTHON SOURCE LINES 33-40
+.. GENERATED FROM PYTHON SOURCE LINES 33-41
 
 .. code-block:: default
 
+
     import tarfile
     import paddle
     import numpy as np
@@ -68,13 +69,13 @@ https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/inst
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 41-44
+.. GENERATED FROM PYTHON SOURCE LINES 47-50
 
 Load pretrained ResNet50 model
 ---------------------------------------------
 We load a pretrained ResNet50 provided by PaddlePaddle.
 
-.. GENERATED FROM PYTHON SOURCE LINES 44-54
+.. GENERATED FROM PYTHON SOURCE LINES 50-60
 
 .. code-block:: default
 
@@ -102,13 +103,13 @@ We load a pretrained ResNet50 provided by PaddlePaddle.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 55-58
+.. GENERATED FROM PYTHON SOURCE LINES 61-64
 
 Load a test image
 ---------------------------------------------
 A single cat dominates the examples!
 
-.. GENERATED FROM PYTHON SOURCE LINES 58-79
+.. GENERATED FROM PYTHON SOURCE LINES 64-85
 
 .. code-block:: default
 
@@ -140,12 +141,12 @@ A single cat dominates the examples!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 80-82
+.. GENERATED FROM PYTHON SOURCE LINES 86-88
 
 Compile the model with relay
 ---------------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 82-92
+.. GENERATED FROM PYTHON SOURCE LINES 88-98
 
 .. code-block:: default
 
@@ -173,12 +174,12 @@ Compile the model with relay
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 93-95
+.. GENERATED FROM PYTHON SOURCE LINES 99-101
 
 Execute on TVM
 ---------------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 95-98
+.. GENERATED FROM PYTHON SOURCE LINES 101-104
 
 .. code-block:: default
 
@@ -192,13 +193,13 @@ Execute on TVM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 99-102
+.. GENERATED FROM PYTHON SOURCE LINES 105-108
 
 Look up synset name
 ---------------------------------------------
 Look up prediction top 1 index in 1000 class synset.
 
-.. GENERATED FROM PYTHON SOURCE LINES 102-118
+.. GENERATED FROM PYTHON SOURCE LINES 108-124
 
 .. code-block:: default
 
@@ -233,11 +234,6 @@ Look up prediction top 1 index in 1000 class synset.
 
 
 
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 1 minutes  7.728 seconds)
-
-
 .. _sphx_glr_download_how_to_compile_models_from_paddle.py:
 
 .. only:: html
diff --git a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
index 4c96efd82..1f1678cb3 100644
--- a/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_pytorch.rst.txt
@@ -43,11 +43,12 @@ with the proper TorchVision version.
 Currently, TVM supports PyTorch 1.7 and 1.4. Other versions may
 be unstable.
 
-.. GENERATED FROM PYTHON SOURCE LINES 43-55
+.. GENERATED FROM PYTHON SOURCE LINES 43-56
 
 .. code-block:: default
 
 
+
     import tvm
     from tvm import relay
 
@@ -66,12 +67,12 @@ be unstable.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 56-58
+.. GENERATED FROM PYTHON SOURCE LINES 62-64
 
 Load a pretrained PyTorch model
 -------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 58-67
+.. GENERATED FROM PYTHON SOURCE LINES 64-73
 
 .. code-block:: default
 
@@ -93,18 +94,18 @@ Load a pretrained PyTorch model
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /workspace/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
-
      0%|          | 0.00/44.7M [00:00<?, ?B/s]
     30%|###       | 13.6M/44.7M [00:00<00:00, 141MB/s]
     64%|######3   | 28.5M/44.7M [00:00<00:00, 149MB/s]
    100%|##########| 44.7M/44.7M [00:00<00:00, 159MB/s]
+
      0%|          | 0.00/44.7M [00:00<?, ?B/s]
     21%|##1       | 9.50M/44.7M [00:00<00:00, 99.6MB/s]
     53%|#####3    | 23.8M/44.7M [00:00<00:00, 129MB/s] 
     83%|########3 | 37.1M/44.7M [00:00<00:00, 134MB/s]
    100%|##########| 44.7M/44.7M [00:00<00:00, 123MB/s]
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 68-71
+.. GENERATED FROM PYTHON SOURCE LINES 74-77
 
 Load a test image
 -----------------
 Classic cat example!
 
-.. GENERATED FROM PYTHON SOURCE LINES 71-91
+.. GENERATED FROM PYTHON SOURCE LINES 77-97
 
 .. code-block:: default
 
@@ -135,13 +136,13 @@ Classic cat example!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 92-95
+.. GENERATED FROM PYTHON SOURCE LINES 98-101
 
 Import the graph to Relay
 -------------------------
 Convert PyTorch graph to Relay graph. The input name can be arbitrary.
 
-.. GENERATED FROM PYTHON SOURCE LINES 95-99
+.. GENERATED FROM PYTHON SOURCE LINES 101-105
 
 .. code-block:: default
 
@@ -156,13 +157,13 @@ Convert PyTorch graph to Relay graph. The input name can be arbitrary.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 100-103
+.. GENERATED FROM PYTHON SOURCE LINES 106-109
 
 Relay Build
 -----------
 Compile the graph to llvm target with given input specification.
 
-.. GENERATED FROM PYTHON SOURCE LINES 103-108
+.. GENERATED FROM PYTHON SOURCE LINES 109-114
 
 .. code-block:: default
 
@@ -185,13 +186,13 @@ Compile the graph to llvm target with given input specification.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 109-112
+.. GENERATED FROM PYTHON SOURCE LINES 115-118
 
 Execute the portable graph on TVM
 ---------------------------------
 Now we can try deploying the compiled model on target.
 
-.. GENERATED FROM PYTHON SOURCE LINES 112-123
+.. GENERATED FROM PYTHON SOURCE LINES 118-129
 
 .. code-block:: default
 
@@ -213,13 +214,13 @@ Now we can try deploying the compiled model on target.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 124-127
+.. GENERATED FROM PYTHON SOURCE LINES 130-133
 
 Look up synset name
 -------------------
 Look up prediction top 1 index in 1000 class synset.
 
-.. GENERATED FROM PYTHON SOURCE LINES 127-172
+.. GENERATED FROM PYTHON SOURCE LINES 133-178
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
index 36cfeb00c..009e690a4 100644
--- a/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_tensorflow.rst.txt
@@ -26,11 +26,12 @@ For us to begin with, tensorflow python module is required to be installed.
 
 Please refer to https://www.tensorflow.org/install
 
-.. GENERATED FROM PYTHON SOURCE LINES 26-69
+.. GENERATED FROM PYTHON SOURCE LINES 26-70
 
 .. code-block:: default
 
 
+
     # tvm, relay
     import tvm
     from tvm import te
@@ -80,14 +81,14 @@ Please refer to https://www.tensorflow.org/install
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 70-74
+.. GENERATED FROM PYTHON SOURCE LINES 76-80
 
 Tutorials
 ---------
 Please refer docs/frontend/tensorflow.md for more details for various models
 from tensorflow.
 
-.. GENERATED FROM PYTHON SOURCE LINES 74-95
+.. GENERATED FROM PYTHON SOURCE LINES 80-101
 
 .. code-block:: default
 
@@ -119,13 +120,13 @@ from tensorflow.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 96-99
+.. GENERATED FROM PYTHON SOURCE LINES 102-105
 
 Download required files
 -----------------------
 Download files listed above.
 
-.. GENERATED FROM PYTHON SOURCE LINES 99-106
+.. GENERATED FROM PYTHON SOURCE LINES 105-112
 
 .. code-block:: default
 
@@ -143,13 +144,13 @@ Download files listed above.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 107-110
+.. GENERATED FROM PYTHON SOURCE LINES 113-116
 
 Import model
 ------------
 Creates tensorflow graph definition from protobuf file.
 
-.. GENERATED FROM PYTHON SOURCE LINES 110-121
+.. GENERATED FROM PYTHON SOURCE LINES 116-127
 
 .. code-block:: default
 
@@ -171,7 +172,7 @@ Creates tensorflow graph definition from protobuf file.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 122-130
+.. GENERATED FROM PYTHON SOURCE LINES 128-136
 
 Decode image
 ------------
@@ -182,7 +183,7 @@ Decode image
   Hence we supply decoded frame to TVM instead.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 130-137
+.. GENERATED FROM PYTHON SOURCE LINES 136-143
 
 .. code-block:: default
 
@@ -200,7 +201,7 @@ Decode image
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 138-145
+.. GENERATED FROM PYTHON SOURCE LINES 144-151
 
 Import the graph to Relay
 -------------------------
@@ -210,7 +211,7 @@ Results:
   sym: relay expr for given tensorflow protobuf.
   params: params converted from tensorflow params (tensor protobuf).
 
-.. GENERATED FROM PYTHON SOURCE LINES 145-150
+.. GENERATED FROM PYTHON SOURCE LINES 151-156
 
 .. code-block:: default
 
@@ -236,7 +237,7 @@ Results:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 151-159
+.. GENERATED FROM PYTHON SOURCE LINES 157-165
 
 Relay Build
 -----------
@@ -247,7 +248,7 @@ Results:
   params: final params after compilation.
   lib: target library which can be deployed on target with TVM runtime.
 
-.. GENERATED FROM PYTHON SOURCE LINES 159-163
+.. GENERATED FROM PYTHON SOURCE LINES 165-169
 
 .. code-block:: default
 
@@ -269,13 +270,13 @@ Results:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 164-167
+.. GENERATED FROM PYTHON SOURCE LINES 170-173
 
 Execute the portable graph on TVM
 ---------------------------------
 Now we can try deploying the compiled model on target.
 
-.. GENERATED FROM PYTHON SOURCE LINES 167-179
+.. GENERATED FROM PYTHON SOURCE LINES 173-185
 
 .. code-block:: default
 
@@ -298,13 +299,13 @@ Now we can try deploying the compiled model on target.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 180-183
+.. GENERATED FROM PYTHON SOURCE LINES 186-189
 
 Process the output
 ------------------
 Process the model output to human readable text for InceptionV1.
 
-.. GENERATED FROM PYTHON SOURCE LINES 183-196
+.. GENERATED FROM PYTHON SOURCE LINES 189-202
 
 .. code-block:: default
 
@@ -338,13 +339,13 @@ Process the model output to human readable text for InceptionV1.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 197-200
+.. GENERATED FROM PYTHON SOURCE LINES 203-206
 
 Inference on tensorflow
 -----------------------
 Run the corresponding model on tensorflow
 
-.. GENERATED FROM PYTHON SOURCE LINES 200-253
+.. GENERATED FROM PYTHON SOURCE LINES 206-259
 
 .. code-block:: default
 
@@ -420,11 +421,6 @@ Run the corresponding model on tensorflow
 
 
 
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 1 minutes  3.691 seconds)
-
-
 .. _sphx_glr_download_how_to_compile_models_from_tensorflow.py:
 
 .. only:: html
diff --git a/docs/_sources/how_to/compile_models/from_tflite.rst.txt b/docs/_sources/how_to/compile_models/from_tflite.rst.txt
index 7233c1336..29fc4f171 100644
--- a/docs/_sources/how_to/compile_models/from_tflite.rst.txt
+++ b/docs/_sources/how_to/compile_models/from_tflite.rst.txt
@@ -55,12 +55,24 @@ Now please check if TFLite package is installed successfully, ``python -c "impor
 
 Below you can find an example on how to compile TFLite model using TVM.
 
-.. GENERATED FROM PYTHON SOURCE LINES 56-58
+.. GENERATED FROM PYTHON SOURCE LINES 55-56
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 62-64
 
 Utils for downloading and extracting zip files
 ----------------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 58-73
+.. GENERATED FROM PYTHON SOURCE LINES 64-79
 
 .. code-block:: default
 
@@ -86,13 +98,13 @@ Utils for downloading and extracting zip files
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 74-77
+.. GENERATED FROM PYTHON SOURCE LINES 80-83
 
 Load pretrained TFLite model
 ----------------------------
 Load mobilenet V1 TFLite model provided by Google
 
-.. GENERATED FROM PYTHON SOURCE LINES 77-100
+.. GENERATED FROM PYTHON SOURCE LINES 83-106
 
 .. code-block:: default
 
@@ -126,13 +138,13 @@ Load mobilenet V1 TFLite model provided by Google
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 101-104
+.. GENERATED FROM PYTHON SOURCE LINES 107-110
 
 Load a test image
 -----------------
 A single cat dominates the examples!
 
-.. GENERATED FROM PYTHON SOURCE LINES 104-125
+.. GENERATED FROM PYTHON SOURCE LINES 110-131
 
 .. code-block:: default
 
@@ -175,12 +187,12 @@ A single cat dominates the examples!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 126-128
+.. GENERATED FROM PYTHON SOURCE LINES 132-134
 
 Compile the model with relay
 ----------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 128-146
+.. GENERATED FROM PYTHON SOURCE LINES 134-152
 
 .. code-block:: default
 
@@ -216,12 +228,12 @@ Compile the model with relay
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 147-149
+.. GENERATED FROM PYTHON SOURCE LINES 153-155
 
 Execute on TVM
 --------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 149-165
+.. GENERATED FROM PYTHON SOURCE LINES 155-171
 
 .. code-block:: default
 
@@ -248,12 +260,12 @@ Execute on TVM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 166-168
+.. GENERATED FROM PYTHON SOURCE LINES 172-174
 
 Display results
 ---------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 168-193
+.. GENERATED FROM PYTHON SOURCE LINES 174-199
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
index 57bc7920b..284c7d3c7 100644
--- a/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/compile_models/sg_execution_times.rst.txt
@@ -5,26 +5,26 @@
 
 Computation times
 =================
-**06:12.515** total execution time for **how_to_compile_models** files:
+**04:52.426** total execution time for **how_to_compile_models** files:
 
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)       | 01:19.763 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_darknet.py` (``from_darknet.py``)       | 00:59.876 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)         | 01:07.728 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``) | 00:59.187 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_tensorflow.py` (``from_tensorflow.py``) | 01:03.691 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_paddle.py` (``from_paddle.py``)         | 00:41.098 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_oneflow.py` (``from_oneflow.py``)       | 00:36.096 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)         | 00:26.186 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)           | 00:32.404 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_oneflow.py` (``from_oneflow.py``)       | 00:25.636 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_tflite.py` (``from_tflite.py``)         | 00:25.138 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)           | 00:23.346 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_mxnet.py` (``from_mxnet.py``)           | 00:23.214 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)         | 00:21.639 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_coreml.py` (``from_coreml.py``)         | 00:22.080 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)       | 00:18.841 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_pytorch.py` (``from_pytorch.py``)       | 00:19.901 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_keras.py` (``from_keras.py``)           | 00:14.333 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)             | 00:02.500 | 0.0 MB |
+| :ref:`sphx_glr_how_to_compile_models_from_onnx.py` (``from_onnx.py``)             | 00:02.282 | 0.0 MB |
 +-----------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
index 2b8ff799b..56e6ae661 100644
--- a/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_model_on_android.rst.txt
@@ -26,11 +26,12 @@ Deploy the Pretrained Model on Android
 
 This is an example of using Relay to compile a keras model and deploy it on Android device.
 
-.. GENERATED FROM PYTHON SOURCE LINES 27-41
+.. GENERATED FROM PYTHON SOURCE LINES 27-42
 
 .. code-block:: default
 
 
+
     import os
     import numpy as np
     from PIL import Image
@@ -51,7 +52,7 @@ This is an example of using Relay to compile a keras model and deploy it on Andr
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 42-84
+.. GENERATED FROM PYTHON SOURCE LINES 48-90
 
 Setup Environment
 -----------------
@@ -96,7 +97,7 @@ After building TVM successfully, Please set PYTHONPATH.
   echo 'export PYTHONPATH=/workspace/python:/workspace/vta/python:${PYTHONPATH}' >> ~/.bashrc
   source ~/.bashrc
 
-.. GENERATED FROM PYTHON SOURCE LINES 86-103
+.. GENERATED FROM PYTHON SOURCE LINES 92-109
 
 Start RPC Tracker
 -----------------
@@ -116,7 +117,7 @@ The expected output is
 
   INFO:RPCTracker:bind to 0.0.0.0:9190
 
-.. GENERATED FROM PYTHON SOURCE LINES 105-186
+.. GENERATED FROM PYTHON SOURCE LINES 111-192
 
 Register Android device to RPC Tracker
 --------------------------------------
@@ -200,13 +201,13 @@ If you use OpenCL and Vulkan, please set :code:`test_opencl` and :code:`test_vul
   python3 tests/android_rpc_test.py
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 188-191
+.. GENERATED FROM PYTHON SOURCE LINES 194-197
 
 Load pretrained keras model
 ---------------------------
 We load a pretrained MobileNetV2(alpha=0.5) classification model provided by keras.
 
-.. GENERATED FROM PYTHON SOURCE LINES 191-206
+.. GENERATED FROM PYTHON SOURCE LINES 197-212
 
 .. code-block:: default
 
@@ -232,12 +233,12 @@ We load a pretrained MobileNetV2(alpha=0.5) classification model provided by ker
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 207-209
+.. GENERATED FROM PYTHON SOURCE LINES 213-215
 
 In order to test our model, here we download an image of cat and
 transform its format.
 
-.. GENERATED FROM PYTHON SOURCE LINES 209-226
+.. GENERATED FROM PYTHON SOURCE LINES 215-232
 
 .. code-block:: default
 
@@ -265,12 +266,12 @@ transform its format.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 227-229
+.. GENERATED FROM PYTHON SOURCE LINES 233-235
 
 synset is used to transform the label from number of ImageNet class to
 the word human can understand.
 
-.. GENERATED FROM PYTHON SOURCE LINES 229-243
+.. GENERATED FROM PYTHON SOURCE LINES 235-249
 
 .. code-block:: default
 
@@ -295,7 +296,7 @@ the word human can understand.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 244-250
+.. GENERATED FROM PYTHON SOURCE LINES 250-256
 
 Compile the model with relay
 ----------------------------
@@ -304,7 +305,7 @@ set it as :code:`llvm`. If running it on the Android device, we need to
 specify its instruction set. Set :code:`local_demo` to False if you want
 to run this tutorial with a real device.
 
-.. GENERATED FROM PYTHON SOURCE LINES 250-286
+.. GENERATED FROM PYTHON SOURCE LINES 256-292
 
 .. code-block:: default
 
@@ -358,14 +359,14 @@ to run this tutorial with a real device.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 287-291
+.. GENERATED FROM PYTHON SOURCE LINES 293-297
 
 Deploy the Model Remotely by RPC
 --------------------------------
 With RPC, you can deploy the model remotely from your host machine
 to the remote android device.
 
-.. GENERATED FROM PYTHON SOURCE LINES 291-319
+.. GENERATED FROM PYTHON SOURCE LINES 297-325
 
 .. code-block:: default
 
@@ -404,12 +405,12 @@ to the remote android device.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 320-322
+.. GENERATED FROM PYTHON SOURCE LINES 326-328
 
 Execute on TVM
 --------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 322-337
+.. GENERATED FROM PYTHON SOURCE LINES 328-343
 
 .. code-block:: default
 
@@ -440,13 +441,13 @@ Execute on TVM
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      16.3846      16.2081      16.9364      15.9044       0.4325   
+      15.5144      15.4960      15.6520      15.4397       0.0695   
                
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 338-361
+.. GENERATED FROM PYTHON SOURCE LINES 344-367
 
 Sample Output
 -------------
diff --git a/docs/_sources/how_to/deploy_models/deploy_model_on_rasp.rst.txt b/docs/_sources/how_to/deploy_models/deploy_model_on_rasp.rst.txt
index ea87701ff..16e736a37 100644
--- a/docs/_sources/how_to/deploy_models/deploy_model_on_rasp.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_model_on_rasp.rst.txt
@@ -27,11 +27,12 @@ Deploy the Pretrained Model on Raspberry Pi
 This is an example of using Relay to compile a ResNet model and deploy
 it on Raspberry Pi.
 
-.. GENERATED FROM PYTHON SOURCE LINES 28-36
+.. GENERATED FROM PYTHON SOURCE LINES 28-37
 
 .. code-block:: default
 
 
+
     import tvm
     from tvm import te
     import tvm.relay as relay
@@ -46,7 +47,7 @@ it on Raspberry Pi.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 37-74
+.. GENERATED FROM PYTHON SOURCE LINES 43-80
 
 .. _build-tvm-runtime-on-device:
 
@@ -86,7 +87,7 @@ directory is in :code:`~/tvm`):
 
 To update the environment variables, execute :code:`source ~/.bashrc`.
 
-.. GENERATED FROM PYTHON SOURCE LINES 76-92
+.. GENERATED FROM PYTHON SOURCE LINES 82-98
 
 Set Up RPC Server on Device
 ---------------------------
@@ -105,7 +106,7 @@ successfully on your device.
      INFO:root:RPCServer: bind to 0.0.0.0:9090
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 94-101
+.. GENERATED FROM PYTHON SOURCE LINES 100-107
 
 Prepare the Pre-trained Model
 -----------------------------
@@ -115,7 +116,7 @@ We will use pre-trained model from
 `MXNet Gluon model zoo <https://mxnet.apache.org/api/python/gluon/model_zoo.html>`_.
 You can found more details about this part at tutorial :ref:`tutorial-from-mxnet`.
 
-.. GENERATED FROM PYTHON SOURCE LINES 101-109
+.. GENERATED FROM PYTHON SOURCE LINES 107-115
 
 .. code-block:: default
 
@@ -134,12 +135,12 @@ You can found more details about this part at tutorial :ref:`tutorial-from-mxnet
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 110-112
+.. GENERATED FROM PYTHON SOURCE LINES 116-118
 
 In order to test our model, here we download an image of cat and
 transform its format.
 
-.. GENERATED FROM PYTHON SOURCE LINES 112-128
+.. GENERATED FROM PYTHON SOURCE LINES 118-134
 
 .. code-block:: default
 
@@ -166,12 +167,12 @@ transform its format.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 129-131
+.. GENERATED FROM PYTHON SOURCE LINES 135-137
 
 synset is used to transform the label from number of ImageNet class to
 the word human can understand.
 
-.. GENERATED FROM PYTHON SOURCE LINES 131-144
+.. GENERATED FROM PYTHON SOURCE LINES 137-150
 
 .. code-block:: default
 
@@ -195,12 +196,12 @@ the word human can understand.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 145-147
+.. GENERATED FROM PYTHON SOURCE LINES 151-153
 
 Now we would like to port the Gluon model to a portable computational graph.
 It's as easy as several lines.
 
-.. GENERATED FROM PYTHON SOURCE LINES 147-155
+.. GENERATED FROM PYTHON SOURCE LINES 153-161
 
 .. code-block:: default
 
@@ -219,11 +220,11 @@ It's as easy as several lines.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 156-157
+.. GENERATED FROM PYTHON SOURCE LINES 162-163
 
 Here are some basic data workload configurations.
 
-.. GENERATED FROM PYTHON SOURCE LINES 157-162
+.. GENERATED FROM PYTHON SOURCE LINES 163-168
 
 .. code-block:: default
 
@@ -239,7 +240,7 @@ Here are some basic data workload configurations.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 163-172
+.. GENERATED FROM PYTHON SOURCE LINES 169-178
 
 Compile The Graph
 -----------------
@@ -251,14 +252,14 @@ apart from arguments :code:`net` and :code:`params` to specify the
 deep learning workload. Actually, the option matters, different option
 will lead to very different performance.
 
-.. GENERATED FROM PYTHON SOURCE LINES 174-178
+.. GENERATED FROM PYTHON SOURCE LINES 180-184
 
 If we run the example on our x86 server for demonstration, we can simply
 set it as :code:`llvm`. If running it on the Raspberry Pi, we need to
 specify its instruction set. Set :code:`local_demo` to False if you want
 to run this tutorial with a real device.
 
-.. GENERATED FROM PYTHON SOURCE LINES 178-200
+.. GENERATED FROM PYTHON SOURCE LINES 184-206
 
 .. code-block:: default
 
@@ -300,14 +301,14 @@ to run this tutorial with a real device.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 201-205
+.. GENERATED FROM PYTHON SOURCE LINES 207-211
 
 Deploy the Model Remotely by RPC
 --------------------------------
 With RPC, you can deploy the model remotely from your host machine
 to the remote device.
 
-.. GENERATED FROM PYTHON SOURCE LINES 205-231
+.. GENERATED FROM PYTHON SOURCE LINES 211-237
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
index cf2cd27f5..d394353ae 100644
--- a/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_object_detection_pytorch.rst.txt
@@ -42,11 +42,12 @@ with the proper TorchVision version.
 Currently, TVM supports PyTorch 1.7 and 1.4. Other versions may
 be unstable.
 
-.. GENERATED FROM PYTHON SOURCE LINES 42-56
+.. GENERATED FROM PYTHON SOURCE LINES 42-57
 
 .. code-block:: default
 
 
+
     import tvm
     from tvm import relay
     from tvm import relay
@@ -67,12 +68,12 @@ be unstable.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 57-59
+.. GENERATED FROM PYTHON SOURCE LINES 63-65
 
 Load pre-trained maskrcnn from torchvision and do tracing
 ---------------------------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 59-96
+.. GENERATED FROM PYTHON SOURCE LINES 65-102
 
 .. code-block:: default
 
@@ -122,7 +123,7 @@ Load pre-trained maskrcnn from torchvision and do tracing
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth" to /workspace/.cache/torch/hub/checkpoints/maskrcnn_resnet50_fpn_coco-bf2d0c1e.pth
-
      0%|          | 0.00/170M [00:00<?, ?B/s]
      2%|1         | 2.72M/170M [00:00<00:06, 28.4MB/s]
      4%|3         | 5.98M/170M [00:00<00:05, 31.8MB/s]
     15%|#4        | 25.1M/170M [00:00<00:01, 109MB/s] 
     24%|##4       | 41.6M/170M [00:00<00:01, 134MB/s]
     38%|###8      | 65.2M/170M [00:00<00:00, 175MB/s]
     51%|#####1    | 86.7M/170M [00:00<00:00, 192MB/s]
     62%|######1   | 105M/170M [00:00<00:00, 192MB/s] 
     73%|#######3  | 124M/170M [00:00<00:00, 195MB/s]
     85%|########4 | 144M/170M [00:00<00:00, 198MB/s]
     98%|#########7| 166M/170M [00:01<00:00, 207MB/s]
    100%|##########| 170M/170M [00:01<00:00, 174MB/s]
+
      0%|          | 0.00/170M [00:00<?, ?B/s]
      1%|          | 1.19M/170M [00:00<00:14, 12.4MB/s]
      2%|2         | 3.66M/170M [00:00<00:08, 20.2MB/s]
      5%|5         | 8.76M/170M [00:00<00:04, 35.3MB/s]
     12%|#1        | 19.8M/170M [00:00<00:02, 66.9MB/s]
     24%|##4       | 41.3M/170M [00:00<00:01, 124MB/s] 
     35%|###5      | 60.0M/170M [00:00<00:00, 149MB/s]
     45%|####5     | 77.1M/170M [00:00<00:00, 159MB/s]
     56%|#####5    | 94.8M/170M [00:00<00:00, 167MB/s]
     66%|######6   | 113M/170M [00:00<00:00, 173MB/s] 
     77%|#######7  | 131M/170M [00:01<00:00, 178MB/s]
     88%|########7 | 149M/170M [00:01<00:00, 183MB/s]
     99%|#########8| 168M/170M [00:01<00:00, 186MB/s]
    100%|##########| 170M/170M [00:01<00:00, 147MB/s]
     /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3878: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
       for i in range(dim)
     /usr/local/lib/python3.7/dist-packages/torchvision/models/detection/anchor_utils.py:127: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
@@ -145,12 +146,12 @@ Load pre-trained maskrcnn from torchvision and do tracing
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 97-99
+.. GENERATED FROM PYTHON SOURCE LINES 103-105
 
 Download a test image and pre-process
 -------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 99-110
+.. GENERATED FROM PYTHON SOURCE LINES 105-116
 
 .. code-block:: default
 
@@ -172,12 +173,12 @@ Download a test image and pre-process
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 111-113
+.. GENERATED FROM PYTHON SOURCE LINES 117-119
 
 Import the graph to Relay
 -------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 113-117
+.. GENERATED FROM PYTHON SOURCE LINES 119-123
 
 .. code-block:: default
 
@@ -199,7 +200,7 @@ Import the graph to Relay
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 118-124
+.. GENERATED FROM PYTHON SOURCE LINES 124-130
 
 Compile with Relay VM
 ---------------------
@@ -208,7 +209,7 @@ highly recommended to build TVM with Intel MKL and Intel OpenMP to get
 best performance, due to the existence of large dense operator in
 torchvision rcnn models.
 
-.. GENERATED FROM PYTHON SOURCE LINES 124-133
+.. GENERATED FROM PYTHON SOURCE LINES 130-139
 
 .. code-block:: default
 
@@ -235,12 +236,12 @@ torchvision rcnn models.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 134-136
+.. GENERATED FROM PYTHON SOURCE LINES 140-142
 
 Inference with Relay VM
 -----------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 136-141
+.. GENERATED FROM PYTHON SOURCE LINES 142-147
 
 .. code-block:: default
 
@@ -256,12 +257,12 @@ Inference with Relay VM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 142-144
+.. GENERATED FROM PYTHON SOURCE LINES 148-150
 
 Get boxes with score larger than 0.9
 ------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 144-154
+.. GENERATED FROM PYTHON SOURCE LINES 150-160
 
 .. code-block:: default
 
@@ -291,7 +292,7 @@ Get boxes with score larger than 0.9
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 3 minutes  0.821 seconds)
+   **Total running time of the script:** ( 2 minutes  51.237 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_object_detection_pytorch.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
index 87ef3e5d8..de0496e3a 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized.rst.txt
@@ -30,11 +30,24 @@ the quantization story in TVM can be found
 Here, we demonstrate how to load and run models quantized by PyTorch, MXNet, and TFLite.
 Once loaded, we can run compiled, quantized models on any hardware TVM supports.
 
-.. GENERATED FROM PYTHON SOURCE LINES 32-33
+.. GENERATED FROM PYTHON SOURCE LINES 30-32
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 38-39
 
 First, necessary imports
 
-.. GENERATED FROM PYTHON SOURCE LINES 33-45
+.. GENERATED FROM PYTHON SOURCE LINES 39-51
 
 .. code-block:: default
 
@@ -57,11 +70,11 @@ First, necessary imports
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 46-47
+.. GENERATED FROM PYTHON SOURCE LINES 52-53
 
 Helper functions to run the demo
 
-.. GENERATED FROM PYTHON SOURCE LINES 47-100
+.. GENERATED FROM PYTHON SOURCE LINES 53-106
 
 .. code-block:: default
 
@@ -125,12 +138,12 @@ Helper functions to run the demo
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 101-103
+.. GENERATED FROM PYTHON SOURCE LINES 107-109
 
 A mapping from label to class name, to verify that the outputs from models below
 are reasonable
 
-.. GENERATED FROM PYTHON SOURCE LINES 103-105
+.. GENERATED FROM PYTHON SOURCE LINES 109-111
 
 .. code-block:: default
 
@@ -143,11 +156,11 @@ are reasonable
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 106-107
+.. GENERATED FROM PYTHON SOURCE LINES 112-113
 
 Everyone's favorite cat image for demonstration
 
-.. GENERATED FROM PYTHON SOURCE LINES 107-109
+.. GENERATED FROM PYTHON SOURCE LINES 113-115
 
 .. code-block:: default
 
@@ -160,7 +173,7 @@ Everyone's favorite cat image for demonstration
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 110-122
+.. GENERATED FROM PYTHON SOURCE LINES 116-128
 
 Deploy a quantized PyTorch Model
 --------------------------------
@@ -175,7 +188,7 @@ We use this function to quantize PyTorch models.
 In short, this function takes a floating point model and converts it to uint8.
 The model is per-channel quantized.
 
-.. GENERATED FROM PYTHON SOURCE LINES 122-133
+.. GENERATED FROM PYTHON SOURCE LINES 128-139
 
 .. code-block:: default
 
@@ -197,14 +210,14 @@ The model is per-channel quantized.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 134-138
+.. GENERATED FROM PYTHON SOURCE LINES 140-144
 
 Load quantization-ready, pretrained Mobilenet v2 model from torchvision
 -----------------------------------------------------------------------
 We choose mobilenet v2 because this model was trained with quantization aware
 training. Other models require a full post training calibration.
 
-.. GENERATED FROM PYTHON SOURCE LINES 138-140
+.. GENERATED FROM PYTHON SOURCE LINES 144-146
 
 .. code-block:: default
 
@@ -219,19 +232,19 @@ training. Other models require a full post training calibration.
  .. code-block:: none
 
     Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /workspace/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
-
      0%|          | 0.00/13.6M [00:00<?, ?B/s]
     99%|#########8| 13.4M/13.6M [00:00<00:00, 140MB/s]
    100%|##########| 13.6M/13.6M [00:00<00:00, 140MB/s]
+
      0%|          | 0.00/13.6M [00:00<?, ?B/s]
     13%|#2        | 1.75M/13.6M [00:00<00:00, 17.6MB/s]
     31%|###1      | 4.25M/13.6M [00:00<00:00, 22.3MB/s]
     52%|#####1    | 7.01M/13.6M [00:00<00:00, 24.6MB/s]
     69%|######9   | 9.36M/13.6M [00:00<00:00, 21.7MB/s]
     89%|########8 | 12.0M/13.6M [00:00<00:00, 22.7MB/s]
    100%|##########| 13.6M/13.6M [00:00<00:00, 22.0MB/s]
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 141-145
+.. GENERATED FROM PYTHON SOURCE LINES 147-151
 
 Quantize, trace and run the PyTorch Mobilenet v2 model
 ------------------------------------------------------
 The details are out of scope for this tutorial. Please refer to the tutorials
 on the PyTorch website to learn about quantization and jit.
 
-.. GENERATED FROM PYTHON SOURCE LINES 145-152
+.. GENERATED FROM PYTHON SOURCE LINES 151-158
 
 .. code-block:: default
 
@@ -258,7 +271,7 @@ on the PyTorch website to learn about quantization and jit.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 153-164
+.. GENERATED FROM PYTHON SOURCE LINES 159-170
 
 Convert quantized Mobilenet v2 to Relay-QNN using the PyTorch frontend
 ----------------------------------------------------------------------
@@ -272,7 +285,7 @@ represented.
 You would see operators specific to quantization such as
 qnn.quantize, qnn.dequantize, qnn.requantize, and qnn.conv2d etc.
 
-.. GENERATED FROM PYTHON SOURCE LINES 164-169
+.. GENERATED FROM PYTHON SOURCE LINES 170-175
 
 .. code-block:: default
 
@@ -288,7 +301,7 @@ qnn.quantize, qnn.dequantize, qnn.requantize, and qnn.conv2d etc.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 170-178
+.. GENERATED FROM PYTHON SOURCE LINES 176-184
 
 Compile and run the Relay module
 --------------------------------
@@ -299,7 +312,7 @@ tutorials for more details.
 Under the hood, quantization specific operators are lowered to a sequence of
 standard Relay operators before compilation.
 
-.. GENERATED FROM PYTHON SOURCE LINES 178-181
+.. GENERATED FROM PYTHON SOURCE LINES 184-187
 
 .. code-block:: default
 
@@ -320,13 +333,13 @@ standard Relay operators before compilation.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 182-185
+.. GENERATED FROM PYTHON SOURCE LINES 188-191
 
 Compare the output labels
 -------------------------
 We should see identical labels printed.
 
-.. GENERATED FROM PYTHON SOURCE LINES 185-191
+.. GENERATED FROM PYTHON SOURCE LINES 191-197
 
 .. code-block:: default
 
@@ -350,13 +363,13 @@ We should see identical labels printed.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 192-195
+.. GENERATED FROM PYTHON SOURCE LINES 198-201
 
 However, due to the difference in numerics, in general the raw floating point
 outputs are not expected to be identical. Here, we print how many floating point
 output values are identical out of 1000 outputs from mobilenet v2.
 
-.. GENERATED FROM PYTHON SOURCE LINES 195-197
+.. GENERATED FROM PYTHON SOURCE LINES 201-203
 
 .. code-block:: default
 
@@ -375,13 +388,13 @@ output values are identical out of 1000 outputs from mobilenet v2.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 198-201
+.. GENERATED FROM PYTHON SOURCE LINES 204-207
 
 Measure performance
 -------------------------
 Here we give an example of how to measure performance of TVM compiled models.
 
-.. GENERATED FROM PYTHON SOURCE LINES 201-205
+.. GENERATED FROM PYTHON SOURCE LINES 207-211
 
 .. code-block:: default
 
@@ -399,13 +412,13 @@ Here we give an example of how to measure performance of TVM compiled models.
 
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      90.5098      90.4704      92.0010      90.2355       0.2184   
+      90.2668      90.2334      90.6747      90.0616       0.1463   
                
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 206-213
+.. GENERATED FROM PYTHON SOURCE LINES 212-219
 
 .. note::
 
@@ -415,7 +428,7 @@ Here we give an example of how to measure performance of TVM compiled models.
    * It includes several warm up runs
    * The same method can be used to profile on remote devices (android etc.).
 
-.. GENERATED FROM PYTHON SOURCE LINES 216-231
+.. GENERATED FROM PYTHON SOURCE LINES 222-237
 
 .. note::
 
@@ -433,13 +446,13 @@ Here we give an example of how to measure performance of TVM compiled models.
    * Choose the best target for your hardware, such as "llvm -mcpu=skylake-avx512" or
      "llvm -mcpu=cascadelake" (more CPUs with AVX512 would come in the future)
 
-.. GENERATED FROM PYTHON SOURCE LINES 234-237
+.. GENERATED FROM PYTHON SOURCE LINES 240-243
 
 Deploy a quantized MXNet Model
 ------------------------------
 TODO
 
-.. GENERATED FROM PYTHON SOURCE LINES 239-242
+.. GENERATED FROM PYTHON SOURCE LINES 245-248
 
 Deploy a quantized TFLite Model
 -------------------------------
@@ -448,7 +461,7 @@ TODO
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  8.576 seconds)
+   **Total running time of the script:** ( 1 minutes  6.022 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_prequantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
index 0c7d64ff3..1f7773185 100644
--- a/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_prequantized_tflite.rst.txt
@@ -43,12 +43,25 @@ To get started, Tensorflow and TFLite package needs to be installed as prerequis
 
 Now please check if TFLite package is installed successfully, ``python -c "import tflite"``
 
-.. GENERATED FROM PYTHON SOURCE LINES 46-48
+.. GENERATED FROM PYTHON SOURCE LINES 44-46
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 52-54
 
 Necessary imports
 -----------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 48-57
+.. GENERATED FROM PYTHON SOURCE LINES 54-63
 
 .. code-block:: default
 
@@ -68,12 +81,12 @@ Necessary imports
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 58-60
+.. GENERATED FROM PYTHON SOURCE LINES 64-66
 
 Download pretrained Quantized TFLite model
 ------------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 60-76
+.. GENERATED FROM PYTHON SOURCE LINES 66-82
 
 .. code-block:: default
 
@@ -100,12 +113,12 @@ Download pretrained Quantized TFLite model
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 77-79
+.. GENERATED FROM PYTHON SOURCE LINES 83-85
 
 Utils for downloading and extracting zip files
 ----------------------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 79-94
+.. GENERATED FROM PYTHON SOURCE LINES 85-100
 
 .. code-block:: default
 
@@ -131,17 +144,17 @@ Utils for downloading and extracting zip files
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 95-97
+.. GENERATED FROM PYTHON SOURCE LINES 101-103
 
 Load a test image
 -----------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 99-101
+.. GENERATED FROM PYTHON SOURCE LINES 105-107
 
 Get a real image for e2e testing
 --------------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 101-116
+.. GENERATED FROM PYTHON SOURCE LINES 107-122
 
 .. code-block:: default
 
@@ -167,16 +180,16 @@ Get a real image for e2e testing
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 117-119
+.. GENERATED FROM PYTHON SOURCE LINES 123-125
 
 Load a tflite model
 -------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 121-122
+.. GENERATED FROM PYTHON SOURCE LINES 127-128
 
 Now we can open mobilenet_v2_1.0_224.tflite
 
-.. GENERATED FROM PYTHON SOURCE LINES 122-135
+.. GENERATED FROM PYTHON SOURCE LINES 128-141
 
 .. code-block:: default
 
@@ -200,11 +213,11 @@ Now we can open mobilenet_v2_1.0_224.tflite
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 136-137
+.. GENERATED FROM PYTHON SOURCE LINES 142-143
 
 Lets run TFLite pre-quantized model inference and get the TFLite prediction.
 
-.. GENERATED FROM PYTHON SOURCE LINES 137-168
+.. GENERATED FROM PYTHON SOURCE LINES 143-174
 
 .. code-block:: default
 
@@ -246,11 +259,11 @@ Lets run TFLite pre-quantized model inference and get the TFLite prediction.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 169-170
+.. GENERATED FROM PYTHON SOURCE LINES 175-176
 
 Lets run TVM compiled pre-quantized model inference and get the TVM prediction.
 
-.. GENERATED FROM PYTHON SOURCE LINES 170-181
+.. GENERATED FROM PYTHON SOURCE LINES 176-187
 
 .. code-block:: default
 
@@ -272,16 +285,16 @@ Lets run TVM compiled pre-quantized model inference and get the TVM prediction.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 182-184
+.. GENERATED FROM PYTHON SOURCE LINES 188-190
 
 TFLite inference
 ----------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 186-187
+.. GENERATED FROM PYTHON SOURCE LINES 192-193
 
 Run TFLite inference on the quantized model.
 
-.. GENERATED FROM PYTHON SOURCE LINES 187-190
+.. GENERATED FROM PYTHON SOURCE LINES 193-196
 
 .. code-block:: default
 
@@ -295,19 +308,19 @@ Run TFLite inference on the quantized model.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 191-193
+.. GENERATED FROM PYTHON SOURCE LINES 197-199
 
 TVM compilation and inference
 -----------------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 195-199
+.. GENERATED FROM PYTHON SOURCE LINES 201-205
 
 We use the TFLite-Relay parser to convert the TFLite pre-quantized graph into Relay IR. Note that
 frontend parser call for a pre-quantized model is exactly same as frontend parser call for a FP32
 model. We encourage you to remove the comment from print(mod) and inspect the Relay module. You
 will see many QNN operators, like, Requantize, Quantize and QNN Conv2D.
 
-.. GENERATED FROM PYTHON SOURCE LINES 199-205
+.. GENERATED FROM PYTHON SOURCE LINES 205-211
 
 .. code-block:: default
 
@@ -324,12 +337,12 @@ will see many QNN operators, like, Requantize, Quantize and QNN Conv2D.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 206-208
+.. GENERATED FROM PYTHON SOURCE LINES 212-214
 
 Lets now the compile the Relay module. We use the "llvm" target here. Please replace it with the
 target platform that you are interested in.
 
-.. GENERATED FROM PYTHON SOURCE LINES 208-212
+.. GENERATED FROM PYTHON SOURCE LINES 214-218
 
 .. code-block:: default
 
@@ -351,11 +364,11 @@ target platform that you are interested in.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 213-214
+.. GENERATED FROM PYTHON SOURCE LINES 219-220
 
 Finally, lets call inference on the TVM compiled module.
 
-.. GENERATED FROM PYTHON SOURCE LINES 214-216
+.. GENERATED FROM PYTHON SOURCE LINES 220-222
 
 .. code-block:: default
 
@@ -368,18 +381,18 @@ Finally, lets call inference on the TVM compiled module.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 217-219
+.. GENERATED FROM PYTHON SOURCE LINES 223-225
 
 Accuracy comparison
 -------------------
 
-.. GENERATED FROM PYTHON SOURCE LINES 221-224
+.. GENERATED FROM PYTHON SOURCE LINES 227-230
 
 Print the top-5 labels for MXNet and TVM inference.
 Checking the labels because the requantize implementation is different between
 TFLite and Relay. This cause final output numbers to mismatch. So, testing accuracy via labels.
 
-.. GENERATED FROM PYTHON SOURCE LINES 224-229
+.. GENERATED FROM PYTHON SOURCE LINES 230-235
 
 .. code-block:: default
 
@@ -402,13 +415,13 @@ TFLite and Relay. This cause final output numbers to mismatch. So, testing accur
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 230-233
+.. GENERATED FROM PYTHON SOURCE LINES 236-239
 
 Measure performance
 -------------------
 Here we give an example of how to measure performance of TVM compiled models.
 
-.. GENERATED FROM PYTHON SOURCE LINES 233-237
+.. GENERATED FROM PYTHON SOURCE LINES 239-243
 
 .. code-block:: default
 
@@ -426,13 +439,13 @@ Here we give an example of how to measure performance of TVM compiled models.
 
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      120.9356     120.8775     122.7959     119.9645      0.4537   
+      119.8513     119.8748     120.4627     119.1539      0.2551   
                
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 238-263
+.. GENERATED FROM PYTHON SOURCE LINES 244-269
 
 .. note::
 
@@ -463,7 +476,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  58.713 seconds)
+   **Total running time of the script:** ( 2 minutes  1.683 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_prequantized_tflite.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
index d663c33d4..3fac64cbd 100644
--- a/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_quantized.rst.txt
@@ -29,11 +29,12 @@ the quantization story in TVM can be found
 In this tutorial, we will import a GluonCV pre-trained model on ImageNet to
 Relay, quantize the Relay model and then perform the inference.
 
-.. GENERATED FROM PYTHON SOURCE LINES 29-44
+.. GENERATED FROM PYTHON SOURCE LINES 29-45
 
 .. code-block:: default
 
 
+
     import tvm
     from tvm import te
     from tvm import relay
@@ -55,14 +56,14 @@ Relay, quantize the Relay model and then perform the inference.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 45-49
+.. GENERATED FROM PYTHON SOURCE LINES 51-55
 
 Prepare the Dataset
 -------------------
 We will demonstrate how to prepare the calibration dataset for quantization.
 We first download the validation set of ImageNet and pre-process the dataset.
 
-.. GENERATED FROM PYTHON SOURCE LINES 49-80
+.. GENERATED FROM PYTHON SOURCE LINES 55-86
 
 .. code-block:: default
 
@@ -104,13 +105,13 @@ We first download the validation set of ImageNet and pre-process the dataset.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 81-84
+.. GENERATED FROM PYTHON SOURCE LINES 87-90
 
 The calibration dataset should be an iterable object. We define the
 calibration dataset as a generator object in Python. In this tutorial, we
 only use a few samples for calibration.
 
-.. GENERATED FROM PYTHON SOURCE LINES 84-98
+.. GENERATED FROM PYTHON SOURCE LINES 90-104
 
 .. code-block:: default
 
@@ -135,13 +136,13 @@ only use a few samples for calibration.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 99-102
+.. GENERATED FROM PYTHON SOURCE LINES 105-108
 
 Import the model
 ----------------
 We use the Relay MxNet frontend to import a model from the Gluon model zoo.
 
-.. GENERATED FROM PYTHON SOURCE LINES 102-110
+.. GENERATED FROM PYTHON SOURCE LINES 108-116
 
 .. code-block:: default
 
@@ -160,7 +161,7 @@ We use the Relay MxNet frontend to import a model from the Gluon model zoo.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 111-132
+.. GENERATED FROM PYTHON SOURCE LINES 117-138
 
 Quantize the Model
 ------------------
@@ -184,7 +185,7 @@ distribution of activation before and after quantization.
 Alternatively, we can also use pre-defined global scales. This saves the time
 for calibration. But the accuracy might be impacted.
 
-.. GENERATED FROM PYTHON SOURCE LINES 132-144
+.. GENERATED FROM PYTHON SOURCE LINES 138-150
 
 .. code-block:: default
 
@@ -207,13 +208,13 @@ for calibration. But the accuracy might be impacted.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 145-148
+.. GENERATED FROM PYTHON SOURCE LINES 151-154
 
 Run Inference
 -------------
 We create a Relay VM to build and execute the model.
 
-.. GENERATED FROM PYTHON SOURCE LINES 148-166
+.. GENERATED FROM PYTHON SOURCE LINES 154-172
 
 .. code-block:: default
 
@@ -254,7 +255,7 @@ We create a Relay VM to build and execute the model.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 3 minutes  34.381 seconds)
+   **Total running time of the script:** ( 1 minutes  33.733 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_quantized.py:
diff --git a/docs/_sources/how_to/deploy_models/deploy_sparse.rst.txt b/docs/_sources/how_to/deploy_models/deploy_sparse.rst.txt
index d59ff0e72..15c8af0e0 100644
--- a/docs/_sources/how_to/deploy_models/deploy_sparse.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_sparse.rst.txt
@@ -72,14 +72,27 @@ When generating random sparse weights for an unpruned model, we do so with struc
 sparsity. A fun exercise is comparing the real speed of PruneBert with the block
 sparse speed using fake weights to see the benefit of structured sparsity.
 
-.. GENERATED FROM PYTHON SOURCE LINES 74-78
+.. GENERATED FROM PYTHON SOURCE LINES 72-74
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 80-84
 
 Load Required Modules
 ---------------------
 Other than TVM, scipy, the latest transformers, and
 tensorflow 2.2+ are required.
 
-.. GENERATED FROM PYTHON SOURCE LINES 78-107
+.. GENERATED FROM PYTHON SOURCE LINES 84-113
 
 .. code-block:: default
 
@@ -119,14 +132,14 @@ tensorflow 2.2+ are required.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 108-112
+.. GENERATED FROM PYTHON SOURCE LINES 114-118
 
 Configure Settings
 ------------------
 Let's start by defining some parameters that define the type of model
 and sparsity to run.
 
-.. GENERATED FROM PYTHON SOURCE LINES 112-136
+.. GENERATED FROM PYTHON SOURCE LINES 118-142
 
 .. code-block:: default
 
@@ -161,7 +174,7 @@ and sparsity to run.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 137-142
+.. GENERATED FROM PYTHON SOURCE LINES 143-148
 
 Download and Convert Transformers Model
 ---------------------------------------
@@ -169,7 +182,7 @@ Now we'll grab a model from the transformers module, download it,
 convert it into a TensorFlow graphdef in preperation for converting that graphdef into
 a relay graph that we can optimize and deploy.
 
-.. GENERATED FROM PYTHON SOURCE LINES 142-178
+.. GENERATED FROM PYTHON SOURCE LINES 148-184
 
 .. code-block:: default
 
@@ -216,7 +229,7 @@ a relay graph that we can optimize and deploy.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 179-185
+.. GENERATED FROM PYTHON SOURCE LINES 185-191
 
 Convert to Relay Graph
 ----------------------
@@ -225,7 +238,7 @@ for relay conversion. Let's import it! In the following function we
 save the imported graph in relay's json format so that we dont have
 to reimport from tensorflow each time this script is run.
 
-.. GENERATED FROM PYTHON SOURCE LINES 185-218
+.. GENERATED FROM PYTHON SOURCE LINES 191-224
 
 .. code-block:: default
 
@@ -269,7 +282,7 @@ to reimport from tensorflow each time this script is run.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 219-225
+.. GENERATED FROM PYTHON SOURCE LINES 225-231
 
 Run the Dense Graph
 -------------------
@@ -278,7 +291,7 @@ the weights are sparse, we won't see any speedup because we are using
 regular dense matrix multiplications on these dense (but mostly zero)
 tensors instead of sparse aware kernels.
 
-.. GENERATED FROM PYTHON SOURCE LINES 225-245
+.. GENERATED FROM PYTHON SOURCE LINES 231-251
 
 .. code-block:: default
 
@@ -309,7 +322,7 @@ tensors instead of sparse aware kernels.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 246-267
+.. GENERATED FROM PYTHON SOURCE LINES 252-273
 
 Run the Sparse Graph
 --------------------
@@ -333,7 +346,7 @@ the rest of the tensor. Once the sparse weights are in BSR format,
 `relay.dense` operations with `relay.sparse_dense` calls that can be
 run faster.
 
-.. GENERATED FROM PYTHON SOURCE LINES 267-316
+.. GENERATED FROM PYTHON SOURCE LINES 273-322
 
 .. code-block:: default
 
@@ -393,7 +406,7 @@ run faster.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 317-322
+.. GENERATED FROM PYTHON SOURCE LINES 323-328
 
 Run All the Code!
 -----------------
@@ -401,7 +414,7 @@ And that's it! Now we'll simply call all the needed function to benchmark
 the model according to the set parameters. Note that to run this code
 you'll need to uncomment the last line first.
 
-.. GENERATED FROM PYTHON SOURCE LINES 322-332
+.. GENERATED FROM PYTHON SOURCE LINES 328-338
 
 .. code-block:: default
 
@@ -422,14 +435,14 @@ you'll need to uncomment the last line first.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 333-337
+.. GENERATED FROM PYTHON SOURCE LINES 339-343
 
 Sample Output
 -------------
 For reference, below is the output of the script when run on an AMD CPU
 and shows about a 2.5X speedup from using sparsity.
 
-.. GENERATED FROM PYTHON SOURCE LINES 337-363
+.. GENERATED FROM PYTHON SOURCE LINES 343-369
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
index b2278a83a..5f0271270 100644
--- a/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
+++ b/docs/_sources/how_to/deploy_models/deploy_ssd_gluoncv.rst.txt
@@ -26,10 +26,11 @@ Deploy Single Shot Multibox Detector(SSD) model
 This article is an introductory tutorial to deploy SSD models with TVM.
 We will use GluonCV pre-trained SSD model and convert it to Relay IR
 
-.. GENERATED FROM PYTHON SOURCE LINES 26-36
+.. GENERATED FROM PYTHON SOURCE LINES 26-37
 
 .. code-block:: default
 
+
     import tvm
     from tvm import te
 
@@ -54,7 +55,7 @@ We will use GluonCV pre-trained SSD model and convert it to Relay IR
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 37-59
+.. GENERATED FROM PYTHON SOURCE LINES 43-65
 
 Preliminary and Set parameters
 ------------------------------
@@ -79,7 +80,7 @@ Preliminary and Set parameters
   :code:`opencl` followed by device argument according
   to your device.
 
-.. GENERATED FROM PYTHON SOURCE LINES 59-72
+.. GENERATED FROM PYTHON SOURCE LINES 65-78
 
 .. code-block:: default
 
@@ -103,11 +104,11 @@ Preliminary and Set parameters
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 73-74
+.. GENERATED FROM PYTHON SOURCE LINES 79-80
 
 Download and pre-process demo image
 
-.. GENERATED FROM PYTHON SOURCE LINES 74-82
+.. GENERATED FROM PYTHON SOURCE LINES 80-88
 
 .. code-block:: default
 
@@ -126,11 +127,11 @@ Download and pre-process demo image
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 83-84
+.. GENERATED FROM PYTHON SOURCE LINES 89-90
 
 Convert and compile model for CPU.
 
-.. GENERATED FROM PYTHON SOURCE LINES 84-95
+.. GENERATED FROM PYTHON SOURCE LINES 90-101
 
 .. code-block:: default
 
@@ -157,12 +158,12 @@ Convert and compile model for CPU.
             data: None
       input_sym_arg_type = in_param.infer_type()[0]
     Downloading /workspace/.mxnet/models/ssd_512_resnet50_v1_voc-9c8b225a.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/ssd_512_resnet50_v1_voc-9c8b225a.zip...
-
      0%|          | 0/132723 [00:00<?, ?KB/s]
      5%|5         | 7075/132723 [00:00<00:01, 70742.48KB/s]
     12%|#1        | 15812/132723 [00:00<00:01, 80518.59KB/s]
     18%|#7        | 23864/132723 [00:00<00:01, 73262.13KB/s]
     25%|##4       | 32663/132723 [00:00<00:01, 78745.49KB/s]
     31%|###1      | 41369/132723 [00:00<00:01, 81628.33KB/s]
     37%|###7      | 49592/132723 [00:00<00:01, 77332.87KB/s]
     44%|####3     | 58338/132723 [00:00<00:00, 80462.88KB/s]
     51%|#####     | 67078/132723 [00:00<00:00, 82586.23KB/s]
     57%|#####7    | 75776/132723 [00:00<00:00, 83921.55KB/s]
     63%|######3   | 84263/132723 [00:01<00:00, 84206.61KB/s]
     70%|#######   | 93031/132723 [00:01<00:00, 85253.31KB/s]
     77%|#######6  | 101578/132723 [00:01<00:00, 64041.57KB/s]
     83%|########3 | 110299/132723 [00:01<00:00, 69699.68KB/s]
     89%|########8 | 117948/132723 [00:01<00:00, 67516.40KB/s]
     96%|#########5| 126763/132723 [00:01<00:00, 72855.28KB/s]
    100%|#######
 ###| 132723/132723 [00:01<00:00, 76215.88KB/s]
+
      0%|          | 0/132723 [00:00<?, ?KB/s]
      4%|3         | 5104/132723 [00:00<00:02, 51034.45KB/s]
     10%|9         | 12703/132723 [00:00<00:01, 65708.77KB/s]
     15%|#5        | 20287/132723 [00:00<00:01, 70332.11KB/s]
     21%|##        | 27750/132723 [00:00<00:01, 72019.92KB/s]
     27%|##6       | 35322/132723 [00:00<00:01, 73350.66KB/s]
     32%|###2      | 42971/132723 [00:00<00:01, 74412.10KB/s]
     38%|###8      | 50627/132723 [00:00<00:01, 75111.25KB/s]
     44%|####3     | 58280/132723 [00:00<00:00, 75557.78KB/s]
     50%|####9     | 65879/132723 [00:00<00:00, 75691.44KB/s]
     55%|#####5    | 73505/132723 [00:01<00:00, 75865.41KB/s]
     61%|######1   | 81203/132723 [00:01<00:00, 76205.13KB/s]
     67%|######6   | 88895/132723 [00:01<00:00, 76420.28KB/s]
     73%|#######2  | 96593/132723 [00:01<00:00, 76586.64KB/s]
     79%|#######8  | 104252/132723 [00:01<00:00, 76284.08KB/s]
     84%|########4 | 111935/132723 [00:01<00:00, 76443.97KB/s]
     90%|#########
  | 119599/132723 [00:01<00:00, 76501.67KB/s]
     96%|#########5| 127250/132723 [00:01<00:00, 76395.85KB/s]
    100%|##########| 132723/132723 [00:01<00:00, 74818.26KB/s]
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 96-101
+.. GENERATED FROM PYTHON SOURCE LINES 102-107
 
 Create TVM runtime and do inference
 .. note::
@@ -170,7 +171,7 @@ Create TVM runtime and do inference
   Use target = "cuda -libs" to enable thrust based sort, if you
   enabled thrust during cmake by -DUSE_THRUST=ON.
 
-.. GENERATED FROM PYTHON SOURCE LINES 101-121
+.. GENERATED FROM PYTHON SOURCE LINES 107-127
 
 .. code-block:: default
 
@@ -208,11 +209,11 @@ Create TVM runtime and do inference
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 122-123
+.. GENERATED FROM PYTHON SOURCE LINES 128-129
 
 Display result
 
-.. GENERATED FROM PYTHON SOURCE LINES 123-132
+.. GENERATED FROM PYTHON SOURCE LINES 129-138
 
 .. code-block:: default
 
@@ -240,7 +241,7 @@ Display result
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  27.956 seconds)
+   **Total running time of the script:** ( 2 minutes  21.667 seconds)
 
 
 .. _sphx_glr_download_how_to_deploy_models_deploy_ssd_gluoncv.py:
diff --git a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
index f0939897d..c46e9681c 100644
--- a/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/deploy_models/sg_execution_times.rst.txt
@@ -5,22 +5,22 @@
 
 Computation times
 =================
-**13:02.481** total execution time for **how_to_deploy_models** files:
+**10:44.960** total execution time for **how_to_deploy_models** files:
 
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)                               | 03:34.381 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``) | 02:51.237 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_object_detection_pytorch.py` (``deploy_object_detection_pytorch.py``) | 03:00.821 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)                           | 02:21.667 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_ssd_gluoncv.py` (``deploy_ssd_gluoncv.py``)                           | 02:27.956 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)           | 02:01.683 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized_tflite.py` (``deploy_prequantized_tflite.py``)           | 01:58.713 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_quantized.py` (``deploy_quantized.py``)                               | 01:33.733 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)                         | 01:08.576 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_prequantized.py` (``deploy_prequantized.py``)                         | 01:06.022 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)                 | 00:29.297 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_android.py` (``deploy_model_on_android.py``)                 | 00:28.367 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)                       | 00:22.732 | 0.0 MB |
+| :ref:`sphx_glr_how_to_deploy_models_deploy_model_on_rasp.py` (``deploy_model_on_rasp.py``)                       | 00:22.246 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_deploy_models_deploy_sparse.py` (``deploy_sparse.py``)                                     | 00:00.006 | 0.0 MB |
 +------------------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
index e0036a0ba..24786948a 100644
--- a/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/bring_your_own_datatypes.rst.txt
@@ -54,14 +54,27 @@ If you would like to try this with your own datatype library, first bring the li
 
     ctypes.CDLL('my-datatype-lib.so', ctypes.RTLD_GLOBAL)
 
-.. GENERATED FROM PYTHON SOURCE LINES 56-60
+.. GENERATED FROM PYTHON SOURCE LINES 54-56
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 62-66
 
 A Simple TVM Program
 --------------------
 
 We'll begin by writing a simple program in TVM; afterwards, we will re-write it to use custom datatypes.
 
-.. GENERATED FROM PYTHON SOURCE LINES 60-70
+.. GENERATED FROM PYTHON SOURCE LINES 66-76
 
 .. code-block:: default
 
@@ -82,11 +95,11 @@ We'll begin by writing a simple program in TVM; afterwards, we will re-write it
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 71-72
+.. GENERATED FROM PYTHON SOURCE LINES 77-78
 
 Now, we create random inputs to feed into this program using numpy:
 
-.. GENERATED FROM PYTHON SOURCE LINES 72-82
+.. GENERATED FROM PYTHON SOURCE LINES 78-88
 
 .. code-block:: default
 
@@ -114,11 +127,11 @@ Now, we create random inputs to feed into this program using numpy:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 83-84
+.. GENERATED FROM PYTHON SOURCE LINES 89-90
 
 Finally, we're ready to run the program:
 
-.. GENERATED FROM PYTHON SOURCE LINES 84-88
+.. GENERATED FROM PYTHON SOURCE LINES 90-94
 
 .. code-block:: default
 
@@ -141,7 +154,7 @@ Finally, we're ready to run the program:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 89-97
+.. GENERATED FROM PYTHON SOURCE LINES 95-103
 
 Adding Custom Datatypes
 -----------------------
@@ -152,7 +165,7 @@ We use the same input variables ``x`` and ``y`` as above, but before adding ``x
 Note how we specify the custom datatype: we indicate it using the special ``custom[...]`` syntax.
 Additionally, note the "32" after the datatype: this is the bitwidth of the custom datatype. This tells TVM that each instance of ``myfloat`` is 32 bits wide.
 
-.. GENERATED FROM PYTHON SOURCE LINES 97-108
+.. GENERATED FROM PYTHON SOURCE LINES 103-114
 
 .. code-block:: default
 
@@ -174,13 +187,13 @@ Additionally, note the "32" after the datatype: this is the bitwidth of the cust
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 109-112
+.. GENERATED FROM PYTHON SOURCE LINES 115-118
 
 Trying to generate this program throws an error from TVM.
 TVM does not know how to handle any custom datatype out of the box!
 We first have to register the custom type with TVM, giving it a name and a type code:
 
-.. GENERATED FROM PYTHON SOURCE LINES 112-115
+.. GENERATED FROM PYTHON SOURCE LINES 118-121
 
 .. code-block:: default
 
@@ -194,13 +207,13 @@ We first have to register the custom type with TVM, giving it a name and a type
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 116-119
+.. GENERATED FROM PYTHON SOURCE LINES 122-125
 
 Note that the type code, 150, is currently chosen manually by the user.
 See ``TVMTypeCode::kCustomBegin`` in `include/tvm/runtime/c_runtime_api.h <https://github.com/apache/tvm/blob/main/include/tvm/runtime/data_type.h>`_.
 Now we can generate our program again:
 
-.. GENERATED FROM PYTHON SOURCE LINES 119-128
+.. GENERATED FROM PYTHON SOURCE LINES 125-134
 
 .. code-block:: default
 
@@ -220,11 +233,11 @@ Now we can generate our program again:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 129-130
+.. GENERATED FROM PYTHON SOURCE LINES 135-136
 
 Now we have a Relay program that uses myfloat!
 
-.. GENERATED FROM PYTHON SOURCE LINES 130-132
+.. GENERATED FROM PYTHON SOURCE LINES 136-138
 
 .. code-block:: default
 
@@ -248,11 +261,11 @@ Now we have a Relay program that uses myfloat!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 133-134
+.. GENERATED FROM PYTHON SOURCE LINES 139-140
 
 Now that we can express our program without errors, let's try running it!
 
-.. GENERATED FROM PYTHON SOURCE LINES 134-142
+.. GENERATED FROM PYTHON SOURCE LINES 140-148
 
 .. code-block:: default
 
@@ -277,7 +290,7 @@ Now that we can express our program without errors, let's try running it!
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 143-152
+.. GENERATED FROM PYTHON SOURCE LINES 149-158
 
 Now, trying to compile this program throws an error.
 Let's dissect this error.
@@ -289,7 +302,7 @@ We have not told TVM how to lower ``Cast`` operations for our custom datatypes;
 
 To fix this error, we simply need to specify a lowering function:
 
-.. GENERATED FROM PYTHON SOURCE LINES 152-165
+.. GENERATED FROM PYTHON SOURCE LINES 158-171
 
 .. code-block:: default
 
@@ -313,7 +326,7 @@ To fix this error, we simply need to specify a lowering function:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 166-177
+.. GENERATED FROM PYTHON SOURCE LINES 172-183
 
 The ``register_op(...)`` call takes a lowering function, and a number of parameters which specify exactly the operation which should be lowered with the provided lowering function.
 In this case, the arguments we pass specify that this lowering function is for lowering a ``Cast`` from ``float`` to ``myfloat`` for target ``"llvm"``.
@@ -327,7 +340,7 @@ which does just this: given a dictionary, it replaces the given operation with a
 It additionally removes usages of the custom datatype by storing the custom datatype in an opaque ``uint`` of the appropriate width; in our case, a ``uint32_t``.
 For more information, see `the source code <https://github.com/apache/tvm/blob/main/python/tvm/target/datatype.py>`_.
 
-.. GENERATED FROM PYTHON SOURCE LINES 177-187
+.. GENERATED FROM PYTHON SOURCE LINES 183-193
 
 .. code-block:: default
 
@@ -354,7 +367,7 @@ For more information, see `the source code <https://github.com/apache/tvm/blob/m
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 188-194
+.. GENERATED FROM PYTHON SOURCE LINES 194-200
 
 This new error tells us that the ``Add`` lowering function is not found, which is good news, as it's no longer complaining about the ``Cast``!
 We know what to do from here: we just need to register the lowering functions for the other operations in our program.
@@ -363,7 +376,7 @@ Note that for ``Add``, ``create_lower_func`` takes in a dict where the key is an
 For ``Cast`` operations, we require a 2-tuple to specify the ``src_bit_length`` and the ``dest_bit_length``,
 while for all other operations, the bit length is the same between the operands so we only require one integer to specify ``bit_length``.
 
-.. GENERATED FROM PYTHON SOURCE LINES 194-220
+.. GENERATED FROM PYTHON SOURCE LINES 200-226
 
 .. code-block:: default
 
@@ -412,7 +425,7 @@ while for all other operations, the bit length is the same between the operands
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 221-230
+.. GENERATED FROM PYTHON SOURCE LINES 227-236
 
 Running Models With Custom Datatypes
 ------------------------------------
@@ -424,7 +437,7 @@ In this alpha state of the Bring Your Own Datatypes framework, we have not imple
 
 First let us define two helper functions to get the mobilenet model and a cat image.
 
-.. GENERATED FROM PYTHON SOURCE LINES 230-257
+.. GENERATED FROM PYTHON SOURCE LINES 236-263
 
 .. code-block:: default
 
@@ -463,16 +476,16 @@ First let us define two helper functions to get the mobilenet model and a cat im
 
  .. code-block:: none
 
-    Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip84b0f19d-62cd-41c0-806f-32091aecde94 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
+    Downloading /workspace/.mxnet/models/mobilenet0.25-9f83e440.zip25d0ef2f-99f9-45bf-9c19-c8a7631be593 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/mobilenet0.25-9f83e440.zip...
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 258-259
+.. GENERATED FROM PYTHON SOURCE LINES 264-265
 
 It's easy to execute MobileNet with native TVM:
 
-.. GENERATED FROM PYTHON SOURCE LINES 259-266
+.. GENERATED FROM PYTHON SOURCE LINES 265-272
 
 .. code-block:: default
 
@@ -499,11 +512,11 @@ It's easy to execute MobileNet with native TVM:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 267-268
+.. GENERATED FROM PYTHON SOURCE LINES 273-274
 
 Now, we would like to change the model to use myfloat internally. To do so, we need to convert the network. To do this, we first define a function which will help us convert tensors:
 
-.. GENERATED FROM PYTHON SOURCE LINES 268-278
+.. GENERATED FROM PYTHON SOURCE LINES 274-284
 
 .. code-block:: default
 
@@ -524,11 +537,11 @@ Now, we would like to change the model to use myfloat internally. To do so, we n
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 279-280
+.. GENERATED FROM PYTHON SOURCE LINES 285-286
 
 Now, to actually convert the entire network, we have written `a pass in Relay <https://github.com/gussmith23/tvm/blob/ea174c01c54a2529e19ca71e125f5884e728da6e/python/tvm/relay/frontend/change_datatype.py#L21>`_ which simply converts all nodes within the model to use the new datatype.
 
-.. GENERATED FROM PYTHON SOURCE LINES 280-315
+.. GENERATED FROM PYTHON SOURCE LINES 286-321
 
 .. code-block:: default
 
@@ -582,14 +595,14 @@ Now, to actually convert the entire network, we have written `a pass in Relay <h
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 316-320
+.. GENERATED FROM PYTHON SOURCE LINES 322-326
 
 When we attempt to run the model, we get a familiar error telling us that more functions need to be registered for myfloat.
 
 Because this is a neural network, many more operations are required.
 Here, we register all the needed functions:
 
-.. GENERATED FROM PYTHON SOURCE LINES 320-388
+.. GENERATED FROM PYTHON SOURCE LINES 326-394
 
 .. code-block:: default
 
@@ -668,7 +681,7 @@ Here, we register all the needed functions:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 389-398
+.. GENERATED FROM PYTHON SOURCE LINES 395-404
 
 Note we are making use of two new functions: ``register_min_func`` and ``create_min_lower_func``.
 
@@ -680,7 +693,7 @@ where the minimum representable custom datatype value is implemented using calls
 
 Now we can finally run the model:
 
-.. GENERATED FROM PYTHON SOURCE LINES 398-409
+.. GENERATED FROM PYTHON SOURCE LINES 404-415
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/extend_tvm/low_level_custom_pass.rst.txt b/docs/_sources/how_to/extend_tvm/low_level_custom_pass.rst.txt
index c0897263a..cb7ecc5c9 100644
--- a/docs/_sources/how_to/extend_tvm/low_level_custom_pass.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/low_level_custom_pass.rst.txt
@@ -42,10 +42,11 @@ Before reading this tutorial, we assume readers have already known these topics
 - How a Schedule is lowered to either an IRModule class or a LLVM module. Otherwise,
   take a look at ``python/tvm/build_module.py`` to get some basics.
 
-.. GENERATED FROM PYTHON SOURCE LINES 43-47
+.. GENERATED FROM PYTHON SOURCE LINES 43-48
 
 .. code-block:: default
 
+
     import tvm
     from tvm import te
     import numpy as np
@@ -57,13 +58,13 @@ Before reading this tutorial, we assume readers have already known these topics
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 48-51
+.. GENERATED FROM PYTHON SOURCE LINES 54-57
 
 We first write a very simple vector add and build it with the default schedule. Then, we use
 our customized lowering pass to manipulate the IR directly instead of using schedule primitives.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 51-61
+.. GENERATED FROM PYTHON SOURCE LINES 57-67
 
 .. code-block:: default
 
@@ -102,7 +103,7 @@ our customized lowering pass to manipulate the IR directly instead of using sche
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 62-67
+.. GENERATED FROM PYTHON SOURCE LINES 68-73
 
 Writing a Pass
 --------------
@@ -110,7 +111,7 @@ Essentially, an "IR transformation pass" is a function which maps a statement to
 Thus, we define this vectorize function and implement it step by step.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 69-84
+.. GENERATED FROM PYTHON SOURCE LINES 75-90
 
 TVM already provides two class for users to both analyze and transform IR.
 
@@ -128,7 +129,7 @@ return value of ``func`` will be ignored.
     refreshed every recursion but the array values will be preserved.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 84-96
+.. GENERATED FROM PYTHON SOURCE LINES 90-102
 
 .. code-block:: default
 
@@ -151,7 +152,7 @@ return value of ``func`` will be ignored.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 97-110
+.. GENERATED FROM PYTHON SOURCE LINES 103-116
 
 IR Transformation
 ~~~~~~~~~~~~~~~~~
@@ -167,7 +168,7 @@ this value.
     function will be skipped.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 110-139
+.. GENERATED FROM PYTHON SOURCE LINES 116-145
 
 .. code-block:: default
 
@@ -207,7 +208,7 @@ this value.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 140-159
+.. GENERATED FROM PYTHON SOURCE LINES 146-165
 
 Glue to Lowering
 ----------------
@@ -229,7 +230,7 @@ called after each phase is done.
 Thus, a good place to put this transformation pass is just after Phase 1.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 159-163
+.. GENERATED FROM PYTHON SOURCE LINES 165-169
 
 .. code-block:: default
 
@@ -263,7 +264,7 @@ Thus, a good place to put this transformation pass is just after Phase 1.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 164-172
+.. GENERATED FROM PYTHON SOURCE LINES 170-178
 
 Quick View
 ----------
diff --git a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
index 22f6bb1b2..2ffe3c41d 100644
--- a/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/sg_execution_times.rst.txt
@@ -5,14 +5,14 @@
 
 Computation times
 =================
-**00:39.822** total execution time for **how_to_extend_tvm** files:
+**00:39.122** total execution time for **how_to_extend_tvm** files:
 
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``) | 00:36.720 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``) | 00:36.031 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)           | 00:02.184 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_use_pass_instrument.py` (``use_pass_instrument.py``)           | 00:02.181 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)                     | 00:00.912 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_use_pass_infra.py` (``use_pass_infra.py``)                     | 00:00.903 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_extend_tvm_low_level_custom_pass.py` (``low_level_custom_pass.py``)       | 00:00.006 | 0.0 MB |
+| :ref:`sphx_glr_how_to_extend_tvm_low_level_custom_pass.py` (``low_level_custom_pass.py``)       | 00:00.008 | 0.0 MB |
 +-------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/extend_tvm/use_pass_infra.rst.txt b/docs/_sources/how_to/extend_tvm/use_pass_infra.rst.txt
index 9e167e49a..c422ed29c 100644
--- a/docs/_sources/how_to/extend_tvm/use_pass_infra.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/use_pass_infra.rst.txt
@@ -41,11 +41,12 @@ This tutorial mainly demonstrates how developers can use the pass infra to perfo
 a certain optimization and create an optimization pipeline for a Relay program.
 The same approach can be used for tir as well.
 
-.. GENERATED FROM PYTHON SOURCE LINES 42-48
+.. GENERATED FROM PYTHON SOURCE LINES 42-49
 
 .. code-block:: default
 
 
+
     import numpy as np
     import tvm
     from tvm import te
@@ -58,7 +59,7 @@ The same approach can be used for tir as well.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 49-54
+.. GENERATED FROM PYTHON SOURCE LINES 55-60
 
 Create An Example Relay Program
 -------------------------------
@@ -66,7 +67,7 @@ First of all, we create a simple Relay program for the tutorial. This program
 will be used by various optimizations of the examples in this tutorial.
 Similarly, users can write a tir primitive function and apply the tir passes.
 
-.. GENERATED FROM PYTHON SOURCE LINES 54-72
+.. GENERATED FROM PYTHON SOURCE LINES 60-78
 
 .. code-block:: default
 
@@ -95,7 +96,7 @@ Similarly, users can write a tir primitive function and apply the tir passes.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 73-83
+.. GENERATED FROM PYTHON SOURCE LINES 79-89
 
 Optimize the Program
 --------------------
@@ -108,7 +109,7 @@ examples for each of them.
 Manually Apply Optimization Passes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. GENERATED FROM PYTHON SOURCE LINES 83-100
+.. GENERATED FROM PYTHON SOURCE LINES 89-106
 
 .. code-block:: default
 
@@ -152,12 +153,12 @@ Manually Apply Optimization Passes
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 101-103
+.. GENERATED FROM PYTHON SOURCE LINES 107-109
 
 More optimizations can be applied in the similar manner. For instance, we can
 eliminate the common expressions that used by `z` and `z1`.
 
-.. GENERATED FROM PYTHON SOURCE LINES 103-106
+.. GENERATED FROM PYTHON SOURCE LINES 109-112
 
 .. code-block:: default
 
@@ -184,13 +185,13 @@ eliminate the common expressions that used by `z` and `z1`.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 107-110
+.. GENERATED FROM PYTHON SOURCE LINES 113-116
 
 Some optimizations, such as fusion, are parametric as well. For example,
 opt level 0 will not allow operators to be fused together. Users can pass the
 `fuse_opt_level` to enable this.
 
-.. GENERATED FROM PYTHON SOURCE LINES 110-116
+.. GENERATED FROM PYTHON SOURCE LINES 116-122
 
 .. code-block:: default
 
@@ -232,7 +233,7 @@ opt level 0 will not allow operators to be fused together. Users can pass the
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 117-136
+.. GENERATED FROM PYTHON SOURCE LINES 123-142
 
 Use Sequential to Apply a Sequence of Passes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -254,7 +255,7 @@ For example, `torch.nn.sequential` is used to contain a sequence of PyTorch
 layers. Instead, the :py:class:`tvm.transform.Sequential` in our pass infra works on the optimizing
 pass.
 
-.. GENERATED FROM PYTHON SOURCE LINES 136-151
+.. GENERATED FROM PYTHON SOURCE LINES 142-157
 
 .. code-block:: default
 
@@ -299,7 +300,7 @@ pass.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 152-159
+.. GENERATED FROM PYTHON SOURCE LINES 158-165
 
 From the transformed Relay program, we can see that there are still two
 identical addition operations. This is because ``EliminateCommonSubexpr``
@@ -309,7 +310,7 @@ optimization level less or equal to 2 will be executed by default under
 however, provides a configuration interface
 for users to customize the optimization level that they want to execute.
 
-.. GENERATED FROM PYTHON SOURCE LINES 159-164
+.. GENERATED FROM PYTHON SOURCE LINES 165-170
 
 .. code-block:: default
 
@@ -343,7 +344,7 @@ for users to customize the optimization level that they want to execute.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 165-172
+.. GENERATED FROM PYTHON SOURCE LINES 171-178
 
 Now we can see that only one of the two identical additions is kept.
 
@@ -353,7 +354,7 @@ general purpose compilers, such as Clang and GCC. For example, we can disable
 EliminateCommonSubexpr as following. The printed module will again show two
 identical addition operations.
 
-.. GENERATED FROM PYTHON SOURCE LINES 172-177
+.. GENERATED FROM PYTHON SOURCE LINES 178-183
 
 .. code-block:: default
 
@@ -388,7 +389,7 @@ identical addition operations.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 178-188
+.. GENERATED FROM PYTHON SOURCE LINES 184-194
 
 Implement a Pass Using Python Decorator
 ------------------------------------------
@@ -401,7 +402,7 @@ with a multiple of `c`. Later on, each function in a given module will be
 visited and each constant in the function will be replaced when we invoke the
 customized pass.
 
-.. GENERATED FROM PYTHON SOURCE LINES 188-215
+.. GENERATED FROM PYTHON SOURCE LINES 194-221
 
 .. code-block:: default
 
@@ -457,7 +458,7 @@ customized pass.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 216-222
+.. GENERATED FROM PYTHON SOURCE LINES 222-228
 
 Debug a Pass
 ------------
@@ -466,7 +467,7 @@ after a certain pass is done through a special pass (``PrintIR``) to dump the IR
 whole module. A slightly modified version of the sequential pass example
 could be like the following to enable IR dumping for ``FoldConstant`` optimization.
 
-.. GENERATED FROM PYTHON SOURCE LINES 222-234
+.. GENERATED FROM PYTHON SOURCE LINES 228-240
 
 .. code-block:: default
 
@@ -489,7 +490,7 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 235-246
+.. GENERATED FROM PYTHON SOURCE LINES 241-252
 
 By inserting the ``PrintIR`` pass after ``FoldConstant``, the pass infra will
 dump out the module IR when ``FoldConstant`` is done. Users can plug in this
@@ -503,7 +504,7 @@ for more details.
 Here we use :py::func`tvm.instrument.pass_instrument` decorator to implement
 a PassInsturment class printing IR before execution of each passes:
 
-.. GENERATED FROM PYTHON SOURCE LINES 246-265
+.. GENERATED FROM PYTHON SOURCE LINES 252-271
 
 .. code-block:: default
 
@@ -667,7 +668,7 @@ a PassInsturment class printing IR before execution of each passes:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 266-274
+.. GENERATED FROM PYTHON SOURCE LINES 272-280
 
 Summary
 -------
diff --git a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
index 6135becf3..096bf9e54 100644
--- a/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
+++ b/docs/_sources/how_to/extend_tvm/use_pass_instrument.rst.txt
@@ -35,10 +35,11 @@ but an extension mechanism is available via the :py:func:`tvm.instrument.pass_in
 This tutorial demonstrates how developers can use ``PassContext`` to instrument
 passes. Please also refer to the :ref:`pass-infra`.
 
-.. GENERATED FROM PYTHON SOURCE LINES 36-47
+.. GENERATED FROM PYTHON SOURCE LINES 36-48
 
 .. code-block:: default
 
+
     import tvm
     import tvm.relay as relay
     from tvm.relay.testing import resnet
@@ -57,13 +58,13 @@ passes. Please also refer to the :ref:`pass-infra`.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 48-51
+.. GENERATED FROM PYTHON SOURCE LINES 54-57
 
 Create An Example Relay Program
 -------------------------------
 We use pre-defined resnet-18 network in Relay.
 
-.. GENERATED FROM PYTHON SOURCE LINES 51-60
+.. GENERATED FROM PYTHON SOURCE LINES 57-66
 
 .. code-block:: default
 
@@ -184,7 +185,7 @@ We use pre-defined resnet-18 network in Relay.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 61-66
+.. GENERATED FROM PYTHON SOURCE LINES 67-72
 
 Create PassContext With Instruments
 -----------------------------------
@@ -192,7 +193,7 @@ To run all passes with an instrument, pass it via the ``instruments`` argument t
 the ``PassContext`` constructor. A built-in ``PassTimingInstrument`` is used to
 profile the execution time of each passes.
 
-.. GENERATED FROM PYTHON SOURCE LINES 66-76
+.. GENERATED FROM PYTHON SOURCE LINES 72-82
 
 .. code-block:: default
 
@@ -215,16 +216,16 @@ profile the execution time of each passes.
  .. code-block:: none
 
     Printing results of timing profile...
-    InferType: 6500us [6500us] (45.24%; 45.24%)
-    FoldScaleAxis: 7869us [5us] (54.76%; 54.76%)
-            FoldConstant: 7864us [1647us] (54.73%; 99.93%)
-                    InferType: 6217us [6217us] (43.27%; 79.06%)
+    InferType: 6410us [6410us] (45.54%; 45.54%)
+    FoldScaleAxis: 7667us [5us] (54.46%; 54.46%)
+            FoldConstant: 7662us [1559us] (54.43%; 99.93%)
+                    InferType: 6103us [6103us] (43.35%; 79.65%)
 
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 77-85
+.. GENERATED FROM PYTHON SOURCE LINES 83-91
 
 Use Current PassContext With Instruments
 ----------------------------------------
@@ -235,7 +236,7 @@ if any instrument already exists. Then it switches to new instruments
 and calls ``enter_pass_ctx`` method of new instruments.
 Refer to following sections and :py:func:`tvm.instrument.pass_instrument` for these methods.
 
-.. GENERATED FROM PYTHON SOURCE LINES 85-94
+.. GENERATED FROM PYTHON SOURCE LINES 91-100
 
 .. code-block:: default
 
@@ -257,23 +258,23 @@ Refer to following sections and :py:func:`tvm.instrument.pass_instrument` for th
  .. code-block:: none
 
     Printing results of timing profile...
-    InferType: 6068us [6068us] (44.31%; 44.31%)
-    FoldScaleAxis: 7625us [5us] (55.69%; 55.69%)
-            FoldConstant: 7620us [1583us] (55.65%; 99.94%)
-                    InferType: 6038us [6038us] (44.09%; 79.23%)
+    InferType: 6109us [6109us] (44.64%; 44.64%)
+    FoldScaleAxis: 7577us [4us] (55.36%; 55.36%)
+            FoldConstant: 7572us [1559us] (55.33%; 99.94%)
+                    InferType: 6014us [6014us] (43.94%; 79.41%)
 
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 95-99
+.. GENERATED FROM PYTHON SOURCE LINES 101-105
 
 Register empty list to clear existing instruments.
 
 Note that ``exit_pass_ctx`` of ``PassTimingInstrument`` is called.
 Profiles are cleared so nothing is printed.
 
-.. GENERATED FROM PYTHON SOURCE LINES 99-105
+.. GENERATED FROM PYTHON SOURCE LINES 105-111
 
 .. code-block:: default
 
@@ -290,7 +291,7 @@ Profiles are cleared so nothing is printed.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 106-114
+.. GENERATED FROM PYTHON SOURCE LINES 112-120
 
 Create Customized Instrument Class
 ----------------------------------
@@ -301,7 +302,7 @@ Let's create an instrument class which calculates the change in number of
 occurrences of each operator caused by each pass. We can look at ``op.name`` to
 find the name of each operator. And we do this before and after passes to calculate the difference.
 
-.. GENERATED FROM PYTHON SOURCE LINES 114-190
+.. GENERATED FROM PYTHON SOURCE LINES 120-196
 
 .. code-block:: default
 
@@ -388,7 +389,7 @@ find the name of each operator. And we do this before and after passes to calcul
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 191-199
+.. GENERATED FROM PYTHON SOURCE LINES 197-205
 
 Apply Passes and Multiple Instrument Classes
 --------------------------------------------
@@ -399,7 +400,7 @@ So for instrument classes like ``PassTimingInstrument``, it is inevitable to
 count-up the execution time of other instrument classes to the final
 profile result.
 
-.. GENERATED FROM PYTHON SOURCE LINES 199-220
+.. GENERATED FROM PYTHON SOURCE LINES 205-226
 
 .. code-block:: default
 
@@ -438,11 +439,11 @@ profile result.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 221-222
+.. GENERATED FROM PYTHON SOURCE LINES 227-228
 
 We can see how many CallNode increase/decrease per op type.
 
-.. GENERATED FROM PYTHON SOURCE LINES 222-228
+.. GENERATED FROM PYTHON SOURCE LINES 228-234
 
 .. code-block:: default
 
@@ -469,7 +470,7 @@ We can see how many CallNode increase/decrease per op type.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 229-234
+.. GENERATED FROM PYTHON SOURCE LINES 235-240
 
 Exception Handling
 ------------------
@@ -477,7 +478,7 @@ Let's see what happens if an exception occurs in a method of a ``PassInstrument`
 
 Define ``PassInstrument`` classes which raise exceptions in enter/exit ``PassContext``:
 
-.. GENERATED FROM PYTHON SOURCE LINES 234-274
+.. GENERATED FROM PYTHON SOURCE LINES 240-280
 
 .. code-block:: default
 
@@ -528,7 +529,7 @@ Define ``PassInstrument`` classes which raise exceptions in enter/exit ``PassCon
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 275-280
+.. GENERATED FROM PYTHON SOURCE LINES 281-286
 
 If an exception occurs in ``enter_pass_ctx``, ``PassContext`` will disable the pass
 instrumentation. And it will run the ``exit_pass_ctx`` of each ``PassInstrument``
@@ -536,7 +537,7 @@ which successfully finished ``enter_pass_ctx``.
 
 In following example, we can see ``exit_pass_ctx`` of `PassFine_0` is executed after exception.
 
-.. GENERATED FROM PYTHON SOURCE LINES 280-293
+.. GENERATED FROM PYTHON SOURCE LINES 286-299
 
 .. code-block:: default
 
@@ -569,12 +570,12 @@ In following example, we can see ``exit_pass_ctx`` of `PassFine_0` is executed a
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 294-296
+.. GENERATED FROM PYTHON SOURCE LINES 300-302
 
 Exceptions in ``PassInstrument`` instances cause all instruments of the current ``PassContext``
 to be cleared, so nothing is printed when ``override_instruments`` is called.
 
-.. GENERATED FROM PYTHON SOURCE LINES 296-298
+.. GENERATED FROM PYTHON SOURCE LINES 302-304
 
 .. code-block:: default
 
@@ -587,13 +588,13 @@ to be cleared, so nothing is printed when ``override_instruments`` is called.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 299-302
+.. GENERATED FROM PYTHON SOURCE LINES 305-308
 
 If an exception occurs in ``exit_pass_ctx``, then the pass instrument is disabled.
 Then exception is propagated. That means ``PassInstrument`` instances registered
 after the one throwing the exception do not execute ``exit_pass_ctx``.
 
-.. GENERATED FROM PYTHON SOURCE LINES 302-316
+.. GENERATED FROM PYTHON SOURCE LINES 308-322
 
 .. code-block:: default
 
@@ -638,7 +639,7 @@ after the one throwing the exception do not execute ``exit_pass_ctx``.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 317-322
+.. GENERATED FROM PYTHON SOURCE LINES 323-328
 
 Exceptions occurred in ``should_run``, ``run_before_pass``, ``run_after_pass``
 are not handled explicitly -- we rely on the context manager (the ``with`` syntax)
@@ -646,7 +647,7 @@ to exit ``PassContext`` safely.
 
 We use ``run_before_pass`` as an example:
 
-.. GENERATED FROM PYTHON SOURCE LINES 322-343
+.. GENERATED FROM PYTHON SOURCE LINES 328-349
 
 .. code-block:: default
 
@@ -695,13 +696,13 @@ We use ``run_before_pass`` as an example:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 344-347
+.. GENERATED FROM PYTHON SOURCE LINES 350-353
 
 Also note that pass instrumentation is not disable. So if we call
 ``override_instruments``, the ``exit_pass_ctx`` of old registered ``PassInstrument``
 is called.
 
-.. GENERATED FROM PYTHON SOURCE LINES 347-349
+.. GENERATED FROM PYTHON SOURCE LINES 353-355
 
 .. code-block:: default
 
@@ -722,12 +723,12 @@ is called.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 350-352
+.. GENERATED FROM PYTHON SOURCE LINES 356-358
 
 If we don't wrap pass execution with ``with`` syntax, ``exit_pass_ctx`` is not
 called. Let try this with current ``PassContext``:
 
-.. GENERATED FROM PYTHON SOURCE LINES 352-361
+.. GENERATED FROM PYTHON SOURCE LINES 358-367
 
 .. code-block:: default
 
@@ -755,12 +756,12 @@ called. Let try this with current ``PassContext``:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 362-364
+.. GENERATED FROM PYTHON SOURCE LINES 368-370
 
 Then call passes. ``exit_pass_ctx`` is not executed after the exception,
 as expectation.
 
-.. GENERATED FROM PYTHON SOURCE LINES 364-370
+.. GENERATED FROM PYTHON SOURCE LINES 370-376
 
 .. code-block:: default
 
@@ -788,11 +789,11 @@ as expectation.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 371-372
+.. GENERATED FROM PYTHON SOURCE LINES 377-378
 
 Clear instruments.
 
-.. GENERATED FROM PYTHON SOURCE LINES 372-373
+.. GENERATED FROM PYTHON SOURCE LINES 378-379
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
index dc5d3cca0..c93f9faef 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_cuda.rst.txt
@@ -31,7 +31,20 @@ example, we use a different layout to store the data in order to achieve better
 data locality. The buffer layout is HWCN, which stands for height, width,
 channel, batch.
 
-.. GENERATED FROM PYTHON SOURCE LINES 34-42
+.. GENERATED FROM PYTHON SOURCE LINES 32-34
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 40-48
 
 Preparation and Algorithm
 -------------------------
@@ -42,7 +55,7 @@ of size 3 x 3.  We use stride size 1 and padding size 1 for the
 convolution. The following code defines the convolution algorithm in TVM.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 42-84
+.. GENERATED FROM PYTHON SOURCE LINES 48-90
 
 .. code-block:: default
 
@@ -95,7 +108,7 @@ convolution. The following code defines the convolution algorithm in TVM.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 85-106
+.. GENERATED FROM PYTHON SOURCE LINES 91-112
 
 Memory Hierarchy
 ----------------
@@ -119,7 +132,7 @@ WL. BL is a local cache of output B, which is also stored in the thread local
 registers.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 106-116
+.. GENERATED FROM PYTHON SOURCE LINES 112-122
 
 .. code-block:: default
 
@@ -140,7 +153,7 @@ registers.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 117-133
+.. GENERATED FROM PYTHON SOURCE LINES 123-139
 
 Blocking
 --------
@@ -159,7 +172,7 @@ shared memory.
      :width: 317px
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 133-161
+.. GENERATED FROM PYTHON SOURCE LINES 139-167
 
 .. code-block:: default
 
@@ -198,7 +211,7 @@ shared memory.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 162-175
+.. GENERATED FROM PYTHON SOURCE LINES 168-181
 
 Virtual Thread Split
 --------------------
@@ -214,7 +227,7 @@ each thread computes 4 strided grids, where size of each grid is 4 x 4.
      :width: 268px
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 175-187
+.. GENERATED FROM PYTHON SOURCE LINES 181-193
 
 .. code-block:: default
 
@@ -237,7 +250,7 @@ each thread computes 4 strided grids, where size of each grid is 4 x 4.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 188-196
+.. GENERATED FROM PYTHON SOURCE LINES 194-202
 
 Cooperative Fetching
 --------------------
@@ -248,7 +261,7 @@ transfer per thread, the following code lets threads in the same thread block
 coopertively fetch dependent data from global memory.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 196-232
+.. GENERATED FROM PYTHON SOURCE LINES 202-238
 
 .. code-block:: default
 
@@ -295,7 +308,7 @@ coopertively fetch dependent data from global memory.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 233-239
+.. GENERATED FROM PYTHON SOURCE LINES 239-245
 
 Generate CUDA Kernel
 --------------------
@@ -304,7 +317,7 @@ Finally we use TVM to generate and compile the CUDA kernel, and evaluate the
 latency of convolution.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 239-250
+.. GENERATED FROM PYTHON SOURCE LINES 245-256
 
 .. code-block:: default
 
@@ -327,7 +340,7 @@ latency of convolution.
 
  .. code-block:: none
 
-    Convolution: 54.247572 ms
+    Convolution: 33.662874 ms
 
 
 
diff --git a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
index 44a0c2ede..9f802c55b 100644
--- a/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_conv_tensorcore.rst.txt
@@ -28,7 +28,20 @@ In this tutorial, we will demonstrate how to write a high performance convolutio
 schedule using TensorCores in TVM. In this example, we assume the input to
 convolution has a large batch. We strongly recommend covering the :ref:`opt-conv-gpu` tutorial first.
 
-.. GENERATED FROM PYTHON SOURCE LINES 31-45
+.. GENERATED FROM PYTHON SOURCE LINES 29-31
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 37-51
 
 TensorCore Introduction
 -----------------------
@@ -45,7 +58,7 @@ with primitive :code:`wmma::load_matrix_sync`, explicitly. The NVCC compiler tra
 that primitive into multiple memory load instructions. At run time, every thread loads
 16 elements from matrix A and 16 elements from B.
 
-.. GENERATED FROM PYTHON SOURCE LINES 47-53
+.. GENERATED FROM PYTHON SOURCE LINES 53-59
 
 Preparation and Algorithm
 -------------------------
@@ -54,7 +67,7 @@ The batch size is 256. Convolution filters contain 512 filters of size 3 x 3.
 We use stride size 1 and padding size 1 for the convolution. In the example, we use
 NHWCnc memory layout.The following code defines the convolution algorithm in TVM.
 
-.. GENERATED FROM PYTHON SOURCE LINES 53-145
+.. GENERATED FROM PYTHON SOURCE LINES 59-151
 
 .. code-block:: default
 
@@ -157,7 +170,7 @@ NHWCnc memory layout.The following code defines the convolution algorithm in TVM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 146-152
+.. GENERATED FROM PYTHON SOURCE LINES 152-158
 
 Memory Scope
 ------------
@@ -166,7 +179,7 @@ To support TensorCores, we add another three special memory scope: :code:`wmma.m
 :code:`wmma.matrix_b` and :code:`wmma.accumulator`. On hardware, all fragments scope
 stores at the on-chip registers level, the same place with local memory.
 
-.. GENERATED FROM PYTHON SOURCE LINES 152-160
+.. GENERATED FROM PYTHON SOURCE LINES 158-166
 
 .. code-block:: default
 
@@ -185,7 +198,7 @@ stores at the on-chip registers level, the same place with local memory.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 161-170
+.. GENERATED FROM PYTHON SOURCE LINES 167-176
 
 Define Tensor Intrinsic
 -----------------------
@@ -197,7 +210,7 @@ There are four basic operation in TensorCore: :code:`fill_fragment`, :code:`load
 :code:`mma_sync` and :code:`store_matrix`. Since :code:`fill_fragment` and :code:`mma_sync`
 are both used in matrix multiplication, so we can just write following three intrinsics.
 
-.. GENERATED FROM PYTHON SOURCE LINES 170-291
+.. GENERATED FROM PYTHON SOURCE LINES 176-297
 
 .. code-block:: default
 
@@ -329,7 +342,7 @@ are both used in matrix multiplication, so we can just write following three int
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 292-313
+.. GENERATED FROM PYTHON SOURCE LINES 298-319
 
 Scheduling the Computation
 --------------------------
@@ -353,7 +366,7 @@ one time.
   TensorCore intrinsics directly or indirectly. Also note that it is not the unique solution.
   The only thing we should do is to make sure all threads in a warp can call TensorCore at the same time.
 
-.. GENERATED FROM PYTHON SOURCE LINES 313-376
+.. GENERATED FROM PYTHON SOURCE LINES 319-382
 
 .. code-block:: default
 
@@ -528,14 +541,14 @@ one time.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 377-381
+.. GENERATED FROM PYTHON SOURCE LINES 383-387
 
 Lowering Computation to Intrinsics
 ----------------------------------
 The last phase is to lower the computation loops down to TensorCore hardware intrinsics
 by mapping the 2D convolution to tensor intrinsics
 
-.. GENERATED FROM PYTHON SOURCE LINES 381-388
+.. GENERATED FROM PYTHON SOURCE LINES 387-394
 
 .. code-block:: default
 
@@ -625,7 +638,7 @@ by mapping the 2D convolution to tensor intrinsics
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 389-394
+.. GENERATED FROM PYTHON SOURCE LINES 395-400
 
 Generate CUDA Kernel
 --------------------
@@ -633,7 +646,7 @@ Finally we use TVM to generate and compile the CUDA kernel, and evaluate the lat
 Since TensorCores are only supported in NVIDIA GPU with Compute Capability 7.0 or higher, it may not
 be able to run on our build server
 
-.. GENERATED FROM PYTHON SOURCE LINES 394-407
+.. GENERATED FROM PYTHON SOURCE LINES 400-413
 
 .. code-block:: default
 
@@ -658,12 +671,12 @@ be able to run on our build server
 
  .. code-block:: none
 
-    conv2d with tensor core: 13.390722 ms
+    conv2d with tensor core: 9.811266 ms
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 408-412
+.. GENERATED FROM PYTHON SOURCE LINES 414-418
 
 Summary
 -------
diff --git a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
index 285d9ea8d..7e9b4859d 100644
--- a/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/opt_gemm.rst.txt
@@ -49,7 +49,20 @@ abstraction automatically, but some of them cannot be simply applied due to TVM
 All the experiment results mentioned below, are executed on 2015's 15' MacBook equipped with
 Intel i7-4770HQ CPU. The cache line size should be 64 bytes for all the x86 CPUs.
 
-.. GENERATED FROM PYTHON SOURCE LINES 52-57
+.. GENERATED FROM PYTHON SOURCE LINES 50-52
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 58-63
 
 Preparation and Baseline
 ------------------------
@@ -57,7 +70,7 @@ In this tutorial, we will demo how to use TVM to optimize matrix multiplication.
 Before actually demonstrating, we first define these variables.
 Then we write a baseline implementation, the simplest way to write a matrix multiplication in TVM.
 
-.. GENERATED FROM PYTHON SOURCE LINES 57-118
+.. GENERATED FROM PYTHON SOURCE LINES 63-124
 
 .. code-block:: default
 
@@ -130,18 +143,18 @@ Then we write a baseline implementation, the simplest way to write a matrix mult
 
  .. code-block:: none
 
-    Numpy running time: 0.019613
-    Baseline: 3.402437
+    Numpy running time: 0.018728
+    Baseline: 3.411026
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 119-121
+.. GENERATED FROM PYTHON SOURCE LINES 125-127
 
 In TVM, we can always inspect lower level IR to debug or optimize our schedule.
 Here is the generated IR using our baseline schedule.
 
-.. GENERATED FROM PYTHON SOURCE LINES 121-124
+.. GENERATED FROM PYTHON SOURCE LINES 127-130
 
 .. code-block:: default
 
@@ -180,7 +193,7 @@ Here is the generated IR using our baseline schedule.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 125-131
+.. GENERATED FROM PYTHON SOURCE LINES 131-137
 
 Blocking
 --------
@@ -189,7 +202,7 @@ block by block. The memory access inside the block is a small neighbourhood whic
 memory locality. In this tutorial, I picked up 32 as the blocking factor. So the block will
 fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB (L1 data cache)
 
-.. GENERATED FROM PYTHON SOURCE LINES 131-156
+.. GENERATED FROM PYTHON SOURCE LINES 137-162
 
 .. code-block:: default
 
@@ -226,16 +239,16 @@ fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB
 
  .. code-block:: none
 
-    Opt1: 0.312701
+    Opt1: 0.297716
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 157-158
+.. GENERATED FROM PYTHON SOURCE LINES 163-164
 
 Here is the generated IR after blocking.
 
-.. GENERATED FROM PYTHON SOURCE LINES 158-161
+.. GENERATED FROM PYTHON SOURCE LINES 164-167
 
 .. code-block:: default
 
@@ -285,7 +298,7 @@ Here is the generated IR after blocking.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 162-170
+.. GENERATED FROM PYTHON SOURCE LINES 168-176
 
 Vectorization
 -------------
@@ -296,7 +309,7 @@ vastly.
 
 In this tutorial, we chose to vectorize the inner loop row data since it is cache friendly.
 
-.. GENERATED FROM PYTHON SOURCE LINES 170-191
+.. GENERATED FROM PYTHON SOURCE LINES 176-197
 
 .. code-block:: default
 
@@ -329,16 +342,16 @@ In this tutorial, we chose to vectorize the inner loop row data since it is cach
 
  .. code-block:: none
 
-    Opt2: 0.340155
+    Opt2: 0.334084
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 192-193
+.. GENERATED FROM PYTHON SOURCE LINES 198-199
 
 Here is the generated IR after vectorization.
 
-.. GENERATED FROM PYTHON SOURCE LINES 193-196
+.. GENERATED FROM PYTHON SOURCE LINES 199-202
 
 .. code-block:: default
 
@@ -384,7 +397,7 @@ Here is the generated IR after vectorization.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 197-203
+.. GENERATED FROM PYTHON SOURCE LINES 203-209
 
 Loop Permutation
 ----------------
@@ -393,7 +406,7 @@ Next we will look at the access pattern of A. In current schedule, A is accessed
 which is not cache friendly. If we change the nested loop order of ki and inner axes mi,
 the access pattern for A matrix is more cache friendly.
 
-.. GENERATED FROM PYTHON SOURCE LINES 203-223
+.. GENERATED FROM PYTHON SOURCE LINES 209-229
 
 .. code-block:: default
 
@@ -425,16 +438,16 @@ the access pattern for A matrix is more cache friendly.
 
  .. code-block:: none
 
-    Opt3: 0.132955
+    Opt3: 0.118797
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 224-225
+.. GENERATED FROM PYTHON SOURCE LINES 230-231
 
 Here is the generated IR after loop permutation.
 
-.. GENERATED FROM PYTHON SOURCE LINES 225-228
+.. GENERATED FROM PYTHON SOURCE LINES 231-234
 
 .. code-block:: default
 
@@ -480,7 +493,7 @@ Here is the generated IR after loop permutation.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 229-239
+.. GENERATED FROM PYTHON SOURCE LINES 235-245
 
 Array Packing
 -------------
@@ -493,7 +506,7 @@ dimensional memory.
 
 NOTE: This figure is a general illustration of how array packing works.
 
-.. GENERATED FROM PYTHON SOURCE LINES 242-249
+.. GENERATED FROM PYTHON SOURCE LINES 248-255
 
 We can use array packing to address the access pattern for B. Observe the array access pattern of
 B after flattening which is not sequential as we iterate over the K dimension. We can reorder B
@@ -503,7 +516,7 @@ bigN (N/bn) and littleN (bn) --- and the new dimensions [N/bn][K][bn] match the
 from outer to inner loops (no, ko, ki, ni) resulting in a sequential access pattern for B after
 flattening.
 
-.. GENERATED FROM PYTHON SOURCE LINES 249-284
+.. GENERATED FROM PYTHON SOURCE LINES 255-290
 
 .. code-block:: default
 
@@ -550,16 +563,16 @@ flattening.
 
  .. code-block:: none
 
-    Opt4: 0.111434
+    Opt4: 0.110474
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 285-286
+.. GENERATED FROM PYTHON SOURCE LINES 291-292
 
 Here is the generated IR after array packing.
 
-.. GENERATED FROM PYTHON SOURCE LINES 286-289
+.. GENERATED FROM PYTHON SOURCE LINES 292-295
 
 .. code-block:: default
 
@@ -612,7 +625,7 @@ Here is the generated IR after array packing.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 290-296
+.. GENERATED FROM PYTHON SOURCE LINES 296-302
 
 Write cache for blocks
 ----------------------
@@ -621,7 +634,7 @@ is not sequential. So we can use a sequential cache array to hold the block resu
 write to C when all the block results are ready.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 296-335
+.. GENERATED FROM PYTHON SOURCE LINES 302-341
 
 .. code-block:: default
 
@@ -672,16 +685,16 @@ write to C when all the block results are ready.
 
  .. code-block:: none
 
-    Opt5: 0.113246
+    Opt5: 0.111212
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 336-337
+.. GENERATED FROM PYTHON SOURCE LINES 342-343
 
 Here is the generated IR after blocking.
 
-.. GENERATED FROM PYTHON SOURCE LINES 337-340
+.. GENERATED FROM PYTHON SOURCE LINES 343-346
 
 .. code-block:: default
 
@@ -744,13 +757,13 @@ Here is the generated IR after blocking.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 341-344
+.. GENERATED FROM PYTHON SOURCE LINES 347-350
 
 Parallel
 --------
 Futhermore, we can also utilize multi-core processors to do the thread-level parallelization.
 
-.. GENERATED FROM PYTHON SOURCE LINES 344-379
+.. GENERATED FROM PYTHON SOURCE LINES 350-385
 
 .. code-block:: default
 
@@ -797,16 +810,16 @@ Futhermore, we can also utilize multi-core processors to do the thread-level par
 
  .. code-block:: none
 
-    Opt6: 0.148467
+    Opt6: 0.145143
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 380-381
+.. GENERATED FROM PYTHON SOURCE LINES 386-387
 
 Here is the generated IR after parallelization.
 
-.. GENERATED FROM PYTHON SOURCE LINES 381-384
+.. GENERATED FROM PYTHON SOURCE LINES 387-390
 
 .. code-block:: default
 
@@ -869,7 +882,7 @@ Here is the generated IR after parallelization.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 387-394
+.. GENERATED FROM PYTHON SOURCE LINES 393-400
 
 Summary
 -------
diff --git a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
index cb40424ab..7576995e6 100644
--- a/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/optimize_operators/sg_execution_times.rst.txt
@@ -5,12 +5,12 @@
 
 Computation times
 =================
-**00:35.408** total execution time for **how_to_optimize_operators** files:
+**00:34.509** total execution time for **how_to_optimize_operators** files:
 
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)                       | 00:32.934 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_gemm.py` (``opt_gemm.py``)                       | 00:32.166 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``) | 00:01.372 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_tensorcore.py` (``opt_conv_tensorcore.py``) | 00:01.305 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)             | 00:01.102 | 0.0 MB |
+| :ref:`sphx_glr_how_to_optimize_operators_opt_conv_cuda.py` (``opt_conv_cuda.py``)             | 00:01.038 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
index d78e2365c..0c13e6ed4 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/sg_execution_times.rst.txt
@@ -5,18 +5,18 @@
 
 Computation times
 =================
-**05:21.415** total execution time for **how_to_tune_with_autoscheduler** files:
+**05:27.877** total execution time for **how_to_tune_with_autoscheduler** files:
 
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``) | 02:42.181 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``) | 02:46.115 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)             | 01:20.647 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_x86.py` (``tune_network_x86.py``)             | 01:21.651 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)           | 00:43.383 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py` (``tune_network_cuda.py``)           | 00:44.027 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)               | 00:17.715 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_sparse_x86.py` (``tune_sparse_x86.py``)               | 00:18.160 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)           | 00:08.905 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_mali.py` (``tune_network_mali.py``)           | 00:09.070 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)             | 00:08.584 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autoscheduler_tune_network_arm.py` (``tune_network_arm.py``)             | 00:08.854 | 0.0 MB |
 +----------------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
index bb715b2fd..c7dedb62e 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_conv2d_layer_cuda.rst.txt
@@ -38,11 +38,12 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 
-.. GENERATED FROM PYTHON SOURCE LINES 39-47
+.. GENERATED FROM PYTHON SOURCE LINES 39-48
 
 .. code-block:: default
 
 
+
     import os
 
     import numpy as np
@@ -57,7 +58,7 @@ __name__ == "__main__":` block.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 48-53
+.. GENERATED FROM PYTHON SOURCE LINES 54-59
 
 Define the computation
 ^^^^^^^^^^^^^^^^^^^^^^
@@ -65,7 +66,7 @@ To begin with, let us define the computation of a convolution layer.
 The function should return the list of input/output tensors.
 From these tensors, the auto-scheduler can get the whole computational graph.
 
-.. GENERATED FROM PYTHON SOURCE LINES 53-65
+.. GENERATED FROM PYTHON SOURCE LINES 59-71
 
 .. code-block:: default
 
@@ -88,13 +89,13 @@ From these tensors, the auto-scheduler can get the whole computational graph.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 66-69
+.. GENERATED FROM PYTHON SOURCE LINES 72-75
 
 Create the search task
 ^^^^^^^^^^^^^^^^^^^^^^
 We then create a search task for the last convolution layer in the resnet.
 
-.. GENERATED FROM PYTHON SOURCE LINES 69-82
+.. GENERATED FROM PYTHON SOURCE LINES 75-88
 
 .. code-block:: default
 
@@ -132,7 +133,7 @@ We then create a search task for the last convolution layer in the resnet.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 83-100
+.. GENERATED FROM PYTHON SOURCE LINES 89-106
 
 Next, we set parameters for the auto-scheduler. These parameters
 mainly specify how we do the measurement during the search.
@@ -152,7 +153,7 @@ mainly specify how we do the measurement during the search.
 * see :any:`auto_scheduler.TuningOptions`,
   :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters.
 
-.. GENERATED FROM PYTHON SOURCE LINES 100-110
+.. GENERATED FROM PYTHON SOURCE LINES 106-116
 
 .. code-block:: default
 
@@ -179,7 +180,7 @@ mainly specify how we do the measurement during the search.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 111-117
+.. GENERATED FROM PYTHON SOURCE LINES 117-123
 
 Run the search
 ^^^^^^^^^^^^^^
@@ -188,7 +189,7 @@ We can kick off the search and let the auto-scheduler do its magic.
 After some measurement trials, we can load the best schedule from the log
 file and apply it.
 
-.. GENERATED FROM PYTHON SOURCE LINES 117-126
+.. GENERATED FROM PYTHON SOURCE LINES 123-132
 
 .. code-block:: default
 
@@ -208,13 +209,13 @@ file and apply it.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 127-130
+.. GENERATED FROM PYTHON SOURCE LINES 133-136
 
 We can lower the schedule to see the IR after auto-scheduling.
 The auto-scheduler correctly performs optimizations including multi-level tiling,
 cooperative fetching, unrolling and operator fusion.
 
-.. GENERATED FROM PYTHON SOURCE LINES 130-134
+.. GENERATED FROM PYTHON SOURCE LINES 136-140
 
 .. code-block:: default
 
@@ -241,627 +242,142 @@ cooperative fetching, unrolling and operator fusion.
       preflattened_buffer_map = {data_1: data_3: Buffer(data_2, float32, [1, 512, 7, 7], []), kernel_1: kernel_3: Buffer(kernel_2, float32, [512, 512, 3, 3], []), bias_1: bias_3: Buffer(bias_2, float32, [1, 512, 1, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [1, 512, 7, 7], [])} {
       attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 64;
       allocate(conv2d_nchw: Pointer(local float32), float32, [8]), storage_scope = local;
-      allocate(pad_temp.shared: Pointer(shared float32), float32, [2592]), storage_scope = shared;
-      allocate(kernel.shared: Pointer(shared float32), float32, [2304]), storage_scope = shared;
+      allocate(pad_temp.shared: Pointer(shared float32), float32, [1008]), storage_scope = shared;
+      allocate(kernel.shared: Pointer(shared float32), float32, [384]), storage_scope = shared;
       attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49 {
-        conv2d_nchw_1: Buffer(conv2d_nchw, float32, [16], [], scope="local", align=16)[0] = 0f32
-        conv2d_nchw_1[4] = 0f32
+        conv2d_nchw_1: Buffer(conv2d_nchw, float32, [8], [], scope="local", align=32)[0] = 0f32
         conv2d_nchw_1[1] = 0f32
-        conv2d_nchw_1[5] = 0f32
         conv2d_nchw_1[2] = 0f32
-        conv2d_nchw_1[6] = 0f32
         conv2d_nchw_1[3] = 0f32
+        conv2d_nchw_1[4] = 0f32
+        conv2d_nchw_1[5] = 0f32
+        conv2d_nchw_1[6] = 0f32
         conv2d_nchw_1[7] = 0f32
-        for (rc.outer.outer: int32, 0, 16) {
-          let cse_var_2: int32 = (rc.outer.outer*1568)
-          let cse_var_1: int32 = (rc.outer.outer*288)
-           {
-            attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1: Buffer(pad_temp.shared, float32, [2592], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else((((9 <= threadIdx.x_1) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[(((cse_var_2 + (floordiv(threadIdx.x_1, 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 49)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 49), 81)) && (floormod((threadIdx.x_1 + 49), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 49), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 49), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 98)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 8), 9)) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 98), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 17), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 147)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 66), 81)) && (floormod((threadIdx.x_1 + 66), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 147), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 66), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 196)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 34), 81)) && (floormod((threadIdx.x_1 + 34), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 196), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 34), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 245)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 2), 81)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 245), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 2), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 294)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 51), 81)) && (floormod((threadIdx.x_1 + 51), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 294), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 51), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 343)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 1), 9)) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 343), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 19), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 68), 81)) && (floormod((threadIdx.x_1 + 68), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 392), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 68), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 441)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 9) + 4), 9)) && (floormod((threadIdx.x_1 + 36), 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 441), 81)*49)) + (floormod((floordiv(threadIdx.x_1, 9) + 4), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 490)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 4), 81)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 490), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 4), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 539)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 53), 81)) && (floormod((threadIdx.x_1 + 53), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 539), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 53), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 588)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 3), 9)) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 588), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 21), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 637)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 70), 81)) && (floormod((threadIdx.x_1 + 70), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 637), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 70), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 686)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 38), 81)) && (floormod((threadIdx.x_1 + 38), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 686), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 38), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 735)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 6), 81)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 735), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 6), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 55), 81)) && (floormod((threadIdx.x_1 + 55), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 784), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 55), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 833)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 5), 9)) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 833), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 23), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 882)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 9) + 8), 9)) && (floormod((threadIdx.x_1 + 72), 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 882), 81)*49)) + (floormod((floordiv(threadIdx.x_1, 9) + 8), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 931)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 40), 81)) && (floormod((threadIdx.x_1 + 40), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 931), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 40), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 980)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 8), 81)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 980), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 8), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1029)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 57), 81)) && (floormod((threadIdx.x_1 + 57), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1029), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 57), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1078)] = @tir.if_then_else((((threadIdx.x_1 < 47) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1078), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 25), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1127)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 74), 81)) && (floormod((threadIdx.x_1 + 74), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1127), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 74), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1176)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 42), 81)) && (floormod((threadIdx.x_1 + 42), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1176), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 42), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1225)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 1), 9)) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1225), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 10), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1274)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 59), 81)) && (floormod((threadIdx.x_1 + 59), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1274), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 59), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1323)] = @tir.if_then_else((((threadIdx.x_1 < 45) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1323), 81)*49)) + ((floordiv(threadIdx.x_1, 9) + 3)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1372)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 76), 81)) && (floormod((threadIdx.x_1 + 76), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1372), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 76), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1421)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 44), 81)) && (floormod((threadIdx.x_1 + 44), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1421), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 44), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1470)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 3), 9)) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1470), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 12), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1519)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 61), 81)) && (floormod((threadIdx.x_1 + 61), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1519), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 61), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1568)] = @tir.if_then_else((((threadIdx.x_1 < 43) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1568), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 29), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1617)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 78), 81)) && (floormod((threadIdx.x_1 + 78), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1617), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 78), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1666)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 46), 81)) && (floormod((threadIdx.x_1 + 46), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1666), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 46), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1715)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 5), 9)) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1715), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 14), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1764)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 9) + 7), 9)) && (floormod((threadIdx.x_1 + 63), 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1764), 81)*49)) + (floormod((floordiv(threadIdx.x_1, 9) + 7), 9)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1813)] = @tir.if_then_else((((threadIdx.x_1 < 41) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1813), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 31), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1862)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 80), 81)) && (floormod((threadIdx.x_1 + 80), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1862), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 80), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1911)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 48), 81)) && (floormod((threadIdx.x_1 + 48), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1911), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 48), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 1960)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 7), 9)) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 1960), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 16), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 2009)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 65), 81)) && (floormod((threadIdx.x_1 + 65), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2009), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 65), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 2058)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 33), 81)) && (floormod((threadIdx.x_1 + 33), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2058), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 33), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 2107)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 1), 81)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2107), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 1), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 2156)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 50), 81)) && (floormod((threadIdx.x_1 + 50), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2156), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 50), 81), 9)*7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 2205)] = @tir.if_then_else(((1 <= floormod(threadIdx.x_1, 9)) && (floormod(threadIdx.x_1, 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2205), 81)*49)) + ((floordiv(threadIdx.x_1, 9) + 2)*7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 2254)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 67), 81)) && (floormod((threadIdx.x_1 + 67), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2254), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 67), 81), 9)*7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 2303)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 35), 81)) && (floormod((threadIdx.x_1 + 35), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2303), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 35), 81), 9)*7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 2352)] = @tir.if_then_else((((9 <= floormod((threadIdx.x_1 + 3), 81)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2352), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 3), 81), 9)*7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 2401)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 52), 81)) && (floormod((threadIdx.x_1 + 52), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2401), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 52), 81), 9)*7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 2450)] = @tir.if_then_else(((1 <= floormod((threadIdx.x_1 + 2), 9)) && (floormod((threadIdx.x_1 + 2), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2450), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 20), 81), 9)*7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            pad_temp.shared_1[(threadIdx.x_1 + 2499)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 69), 81)) && (floormod((threadIdx.x_1 + 69), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2499), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 69), 81), 9)*7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
-            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            if @tir.likely((threadIdx.x_1 < 44), dtype=bool) {
-              pad_temp.shared_1[(threadIdx.x_1 + 2548)] = @tir.if_then_else((((threadIdx.x_1 < 35) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), data[((((cse_var_2 + (floordiv((threadIdx.x_1 + 2548), 81)*49)) + (floordiv(floormod((threadIdx.x_1 + 37), 81), 9)*7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
-            }
-            attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1: Buffer(kernel.shared, float32, [2304], [], scope="shared")[threadIdx.x_2] = kernel[(((blockIdx.x*36864) + cse_var_1) + threadIdx.x_2)]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 49)] = kernel[((((blockIdx.x*36864) + cse_var_1) + (floordiv((threadIdx.x_2 + 49), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 98)] = kernel[((((blockIdx.x*36864) + cse_var_1) + (floordiv((threadIdx.x_2 + 98), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 147)] = kernel[((((blockIdx.x*36864) + cse_var_1) + threadIdx.x_2) + 147)]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 196)] = kernel[((((blockIdx.x*36864) + cse_var_1) + (floordiv((threadIdx.x_2 + 196), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 245)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 245), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 245), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 294)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 294), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 2)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 343)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 343), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 55), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 392)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 392), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 104), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 441)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 441), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 51)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 490)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 490), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 202), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 539)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 539), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 251), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 588)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 588), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 4)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 637)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 637), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 61), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 686)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 686), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 110), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 735)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 735), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 53)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 784)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 784), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 208), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 833)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 833), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 257), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 882)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 882), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 6)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 931)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 931), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 67), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 980)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 980), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 116), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1029)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1029), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 55)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1078)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1078), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 214), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1127)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1127), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 263), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1176)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1176), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 8)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1225)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1225), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 73), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1274)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1274), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 122), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1323)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1323), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 57)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1372)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1372), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 220), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1421)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1421), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 269), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1470)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1470), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 10)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1519)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1519), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 79), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1568)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1568), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 128), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1617)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1617), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 59)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1666)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1666), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 226), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1715)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1715), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 275), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1764)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1764), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 12)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1813)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1813), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 85), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1862)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1862), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 134), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1911)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1911), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 61)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 1960)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1960), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 232), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 2009)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 2009), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 281), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 2058)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 2058), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 14)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 2107)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 2107), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 91), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 2156)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 2156), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 140), 288), 3)*3)) + floormod((threadIdx.x_2 + 2), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 2205)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 2205), 288)*4608)) + cse_var_1) + ((floordiv(threadIdx.x_2, 3) + 63)*3)) + floormod(threadIdx.x_2, 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            kernel.shared_1[(threadIdx.x_2 + 2254)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 2254), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 238), 288), 3)*3)) + floormod((threadIdx.x_2 + 1), 3))]
-            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
-            if @tir.likely((threadIdx.x_2 < 1), dtype=bool) {
-              kernel.shared_1[(threadIdx.x_2 + 2303)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 2303), 288)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 287), 288), 3)*3)) + (threadIdx.x_2 + 2))]
-            }
-            for (rx.outer.inner: int32, 0, 3) {
-              for (ff.outer.inner: int32, 0, 2) {
-                let cse_var_7: int32 = (ff.outer.inner*2)
-                let cse_var_6: int32 = ((ff.outer.inner*576) + rx.outer.inner)
-                let cse_var_5: int32 = (cse_var_7 + 5)
-                let cse_var_4: int32 = (cse_var_7 + 4)
-                let cse_var_3: int32 = (cse_var_7 + 1)
+        for (rc.outer.outer: int32, 0, 32) {
+          for (rx.outer.outer: int32, 0, 3) {
+            let cse_var_2: int32 = (rc.outer.outer*784)
+            let cse_var_1: int32 = (rc.outer.outer*144)
+             {
+              attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1: Buffer(pad_temp.shared, float32, [1008], [], scope="shared")[threadIdx.x_1] = @tir.if_then_else((((7 <= threadIdx.x_1) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((cse_var_2 + threadIdx.x_1) + rx.outer.outer) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 49)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 7), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 7), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 49), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 7), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 98)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 5), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 5), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 98), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 5), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 147)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 3), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 3), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 147), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 3), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 196)] = @tir.if_then_else(((1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 196), 63)*49)) + ((floordiv(threadIdx.x_1, 7) + 1)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 245)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 8), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 8), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 245), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 8), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 294)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 6), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 6), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 294), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 6), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 343)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 4), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 4), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 343), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 4), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 392)] = @tir.if_then_else((((threadIdx.x_1 < 42) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 392), 63)*49)) + ((floordiv(threadIdx.x_1, 7) + 2)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 441)] = @tir.if_then_else((((7 <= threadIdx.x_1) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((cse_var_2 + threadIdx.x_1) + rx.outer.outer) + 335)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 490)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 7), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 7), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 490), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 7), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 539)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 5), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 5), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 539), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 5), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 588)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 3), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 3), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 588), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 3), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 637)] = @tir.if_then_else(((1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 637), 63)*49)) + ((floordiv(threadIdx.x_1, 7) + 1)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 686)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 8), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 8), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 686), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 8), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 735)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 6), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 6), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 735), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 6), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 784)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 4), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 4), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 784), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 4), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 833)] = @tir.if_then_else((((threadIdx.x_1 < 42) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 833), 63)*49)) + ((floordiv(threadIdx.x_1, 7) + 2)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 882)] = @tir.if_then_else((((7 <= threadIdx.x_1) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((cse_var_2 + threadIdx.x_1) + rx.outer.outer) + 678)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              pad_temp.shared_1[(threadIdx.x_1 + 931)] = @tir.if_then_else(((((1 <= floormod((floordiv(threadIdx.x_1, 7) + 7), 9)) && (floormod((floordiv(threadIdx.x_1, 7) + 7), 9) < 8)) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 931), 63)*49)) + (floormod((floordiv(threadIdx.x_1, 7) + 7), 9)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              if @tir.likely((threadIdx.x_1 < 28), dtype=bool) {
+                pad_temp.shared_1[(threadIdx.x_1 + 980)] = @tir.if_then_else((((threadIdx.x_1 < 21) && (1 <= (rx.outer.outer + floormod(threadIdx.x_1, 7)))) && ((rx.outer.outer + floormod(threadIdx.x_1, 7)) < 8)), data[(((((cse_var_2 + (floordiv((threadIdx.x_1 + 980), 63)*49)) + ((floordiv(threadIdx.x_1, 7) + 5)*7)) + rx.outer.outer) + floormod(threadIdx.x_1, 7)) - 8)], 0f32, dtype=float32)
+              }
+              attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              kernel.shared_1: Buffer(kernel.shared, float32, [384], [], scope="shared")[threadIdx.x_2] = kernel[(((((blockIdx.x*36864) + (floordiv(threadIdx.x_2, 48)*4608)) + cse_var_1) + (floormod(threadIdx.x_2, 48)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              kernel.shared_1[(threadIdx.x_2 + 49)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 49), 48)*4608)) + cse_var_1) + (floormod((threadIdx.x_2 + 1), 48)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              kernel.shared_1[(threadIdx.x_2 + 98)] = kernel[(((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 98), 48)*4608)) + cse_var_1) + (floormod((threadIdx.x_2 + 2), 48)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              kernel.shared_1[(threadIdx.x_2 + 147)] = kernel[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 147), 48)*4608)) + cse_var_1) + (floormod((floordiv(threadIdx.x_2, 3) + 1), 16)*9)) + (floormod(threadIdx.x_2, 3)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              kernel.shared_1[(threadIdx.x_2 + 196)] = kernel[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 196), 48)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 4), 48), 3)*9)) + (floormod((threadIdx.x_2 + 1), 3)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              kernel.shared_1[(threadIdx.x_2 + 245)] = kernel[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 245), 48)*4608)) + cse_var_1) + (floordiv(floormod((threadIdx.x_2 + 5), 48), 3)*9)) + (floormod((threadIdx.x_2 + 2), 3)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              kernel.shared_1[(threadIdx.x_2 + 294)] = kernel[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 294), 48)*4608)) + cse_var_1) + (floormod((floordiv(threadIdx.x_2, 3) + 2), 16)*9)) + (floormod(threadIdx.x_2, 3)*3)) + rx.outer.outer)]
+              attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 49;
+              if @tir.likely((threadIdx.x_2 < 41), dtype=bool) {
+                kernel.shared_1[(threadIdx.x_2 + 343)] = kernel[((((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 343), 48)*4608)) + cse_var_1) + (floordiv((threadIdx.x_2 + 7), 3)*9)) + (floormod((threadIdx.x_2 + 1), 3)*3)) + rx.outer.outer)]
+              }
+              for (rc.outer.inner: int32, 0, 8) {
+                let cse_var_3: int32 = (rc.outer.inner*6)
                  {
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[(((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7))]*kernel.shared_1[cse_var_6]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[(((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_6 + 1152)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[(((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_6 + 288)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[(((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7))]*kernel.shared_1[(cse_var_6 + 1440)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_6 + 3)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_6 + 1155)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_6 + 291)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 9)]*kernel.shared_1[(cse_var_6 + 1443)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_6 + 6)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_6 + 1158)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_6 + 294)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 18)]*kernel.shared_1[(cse_var_6 + 1446)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_6 + 9)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_6 + 1161)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_6 + 297)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 81)]*kernel.shared_1[(cse_var_6 + 1449)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_6 + 12)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_6 + 1164)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_6 + 300)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 90)]*kernel.shared_1[(cse_var_6 + 1452)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_6 + 15)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_6 + 1167)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_6 + 303)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 99)]*kernel.shared_1[(cse_var_6 + 1455)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_6 + 18)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_6 + 1170)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_6 + 306)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 162)]*kernel.shared_1[(cse_var_6 + 1458)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_6 + 21)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_6 + 1173)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_6 + 309)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 171)]*kernel.shared_1[(cse_var_6 + 1461)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_6 + 24)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_6 + 1176)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_6 + 312)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 180)]*kernel.shared_1[(cse_var_6 + 1464)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_6 + 27)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_6 + 1179)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_6 + 315)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 243)]*kernel.shared_1[(cse_var_6 + 1467)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_6 + 30)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_6 + 1182)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_6 + 318)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 252)]*kernel.shared_1[(cse_var_6 + 1470)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_6 + 33)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_6 + 1185)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_6 + 321)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 261)]*kernel.shared_1[(cse_var_6 + 1473)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_6 + 36)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_6 + 1188)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_6 + 324)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 324)]*kernel.shared_1[(cse_var_6 + 1476)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_6 + 39)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_6 + 1191)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_6 + 327)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 333)]*kernel.shared_1[(cse_var_6 + 1479)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_6 + 42)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_6 + 1194)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_6 + 330)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 342)]*kernel.shared_1[(cse_var_6 + 1482)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_6 + 45)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_6 + 1197)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_6 + 333)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 405)]*kernel.shared_1[(cse_var_6 + 1485)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_6 + 48)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_6 + 1200)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_6 + 336)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 414)]*kernel.shared_1[(cse_var_6 + 1488)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_6 + 51)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_6 + 1203)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_6 + 339)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 423)]*kernel.shared_1[(cse_var_6 + 1491)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_6 + 54)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_6 + 1206)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_6 + 342)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 486)]*kernel.shared_1[(cse_var_6 + 1494)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_6 + 57)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_6 + 1209)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_6 + 345)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 495)]*kernel.shared_1[(cse_var_6 + 1497)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_6 + 60)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_6 + 1212)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_6 + 348)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 504)]*kernel.shared_1[(cse_var_6 + 1500)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_6 + 63)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_6 + 1215)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_6 + 351)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 567)]*kernel.shared_1[(cse_var_6 + 1503)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_6 + 66)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_6 + 1218)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_6 + 354)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 576)]*kernel.shared_1[(cse_var_6 + 1506)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_6 + 69)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_6 + 1221)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_6 + 357)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 585)]*kernel.shared_1[(cse_var_6 + 1509)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 648)]*kernel.shared_1[(cse_var_6 + 72)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 648)]*kernel.shared_1[(cse_var_6 + 1224)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 648)]*kernel.shared_1[(cse_var_6 + 360)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 648)]*kernel.shared_1[(cse_var_6 + 1512)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 657)]*kernel.shared_1[(cse_var_6 + 75)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 657)]*kernel.shared_1[(cse_var_6 + 1227)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 657)]*kernel.shared_1[(cse_var_6 + 363)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 657)]*kernel.shared_1[(cse_var_6 + 1515)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 666)]*kernel.shared_1[(cse_var_6 + 78)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 666)]*kernel.shared_1[(cse_var_6 + 1230)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 666)]*kernel.shared_1[(cse_var_6 + 366)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 666)]*kernel.shared_1[(cse_var_6 + 1518)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 729)]*kernel.shared_1[(cse_var_6 + 81)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 729)]*kernel.shared_1[(cse_var_6 + 1233)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 729)]*kernel.shared_1[(cse_var_6 + 369)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 729)]*kernel.shared_1[(cse_var_6 + 1521)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 738)]*kernel.shared_1[(cse_var_6 + 84)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 738)]*kernel.shared_1[(cse_var_6 + 1236)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 738)]*kernel.shared_1[(cse_var_6 + 372)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 738)]*kernel.shared_1[(cse_var_6 + 1524)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 747)]*kernel.shared_1[(cse_var_6 + 87)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 747)]*kernel.shared_1[(cse_var_6 + 1239)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 747)]*kernel.shared_1[(cse_var_6 + 375)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 747)]*kernel.shared_1[(cse_var_6 + 1527)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 810)]*kernel.shared_1[(cse_var_6 + 90)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 810)]*kernel.shared_1[(cse_var_6 + 1242)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 810)]*kernel.shared_1[(cse_var_6 + 378)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 810)]*kernel.shared_1[(cse_var_6 + 1530)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 819)]*kernel.shared_1[(cse_var_6 + 93)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 819)]*kernel.shared_1[(cse_var_6 + 1245)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 819)]*kernel.shared_1[(cse_var_6 + 381)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 819)]*kernel.shared_1[(cse_var_6 + 1533)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 828)]*kernel.shared_1[(cse_var_6 + 96)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 828)]*kernel.shared_1[(cse_var_6 + 1248)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 828)]*kernel.shared_1[(cse_var_6 + 384)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 828)]*kernel.shared_1[(cse_var_6 + 1536)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 891)]*kernel.shared_1[(cse_var_6 + 99)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 891)]*kernel.shared_1[(cse_var_6 + 1251)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 891)]*kernel.shared_1[(cse_var_6 + 387)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 891)]*kernel.shared_1[(cse_var_6 + 1539)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 900)]*kernel.shared_1[(cse_var_6 + 102)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 900)]*kernel.shared_1[(cse_var_6 + 1254)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 900)]*kernel.shared_1[(cse_var_6 + 390)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 900)]*kernel.shared_1[(cse_var_6 + 1542)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 909)]*kernel.shared_1[(cse_var_6 + 105)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 909)]*kernel.shared_1[(cse_var_6 + 1257)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 909)]*kernel.shared_1[(cse_var_6 + 393)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 909)]*kernel.shared_1[(cse_var_6 + 1545)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 972)]*kernel.shared_1[(cse_var_6 + 108)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 972)]*kernel.shared_1[(cse_var_6 + 1260)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 972)]*kernel.shared_1[(cse_var_6 + 396)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 972)]*kernel.shared_1[(cse_var_6 + 1548)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 981)]*kernel.shared_1[(cse_var_6 + 111)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 981)]*kernel.shared_1[(cse_var_6 + 1263)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 981)]*kernel.shared_1[(cse_var_6 + 399)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 981)]*kernel.shared_1[(cse_var_6 + 1551)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 990)]*kernel.shared_1[(cse_var_6 + 114)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 990)]*kernel.shared_1[(cse_var_6 + 1266)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 990)]*kernel.shared_1[(cse_var_6 + 402)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 990)]*kernel.shared_1[(cse_var_6 + 1554)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1053)]*kernel.shared_1[(cse_var_6 + 117)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1053)]*kernel.shared_1[(cse_var_6 + 1269)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1053)]*kernel.shared_1[(cse_var_6 + 405)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1053)]*kernel.shared_1[(cse_var_6 + 1557)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1062)]*kernel.shared_1[(cse_var_6 + 120)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1062)]*kernel.shared_1[(cse_var_6 + 1272)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1062)]*kernel.shared_1[(cse_var_6 + 408)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1062)]*kernel.shared_1[(cse_var_6 + 1560)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1071)]*kernel.shared_1[(cse_var_6 + 123)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1071)]*kernel.shared_1[(cse_var_6 + 1275)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1071)]*kernel.shared_1[(cse_var_6 + 411)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1071)]*kernel.shared_1[(cse_var_6 + 1563)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1134)]*kernel.shared_1[(cse_var_6 + 126)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1134)]*kernel.shared_1[(cse_var_6 + 1278)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1134)]*kernel.shared_1[(cse_var_6 + 414)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1134)]*kernel.shared_1[(cse_var_6 + 1566)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1143)]*kernel.shared_1[(cse_var_6 + 129)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1143)]*kernel.shared_1[(cse_var_6 + 1281)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1143)]*kernel.shared_1[(cse_var_6 + 417)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1143)]*kernel.shared_1[(cse_var_6 + 1569)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1152)]*kernel.shared_1[(cse_var_6 + 132)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1152)]*kernel.shared_1[(cse_var_6 + 1284)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1152)]*kernel.shared_1[(cse_var_6 + 420)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1152)]*kernel.shared_1[(cse_var_6 + 1572)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1215)]*kernel.shared_1[(cse_var_6 + 135)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1215)]*kernel.shared_1[(cse_var_6 + 1287)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1215)]*kernel.shared_1[(cse_var_6 + 423)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1215)]*kernel.shared_1[(cse_var_6 + 1575)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1224)]*kernel.shared_1[(cse_var_6 + 138)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1224)]*kernel.shared_1[(cse_var_6 + 1290)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1224)]*kernel.shared_1[(cse_var_6 + 426)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1224)]*kernel.shared_1[(cse_var_6 + 1578)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1233)]*kernel.shared_1[(cse_var_6 + 141)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1233)]*kernel.shared_1[(cse_var_6 + 1293)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1233)]*kernel.shared_1[(cse_var_6 + 429)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1233)]*kernel.shared_1[(cse_var_6 + 1581)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1296)]*kernel.shared_1[(cse_var_6 + 144)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1296)]*kernel.shared_1[(cse_var_6 + 1296)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1296)]*kernel.shared_1[(cse_var_6 + 432)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1296)]*kernel.shared_1[(cse_var_6 + 1584)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1305)]*kernel.shared_1[(cse_var_6 + 147)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1305)]*kernel.shared_1[(cse_var_6 + 1299)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1305)]*kernel.shared_1[(cse_var_6 + 435)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1305)]*kernel.shared_1[(cse_var_6 + 1587)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1314)]*kernel.shared_1[(cse_var_6 + 150)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1314)]*kernel.shared_1[(cse_var_6 + 1302)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1314)]*kernel.shared_1[(cse_var_6 + 438)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1314)]*kernel.shared_1[(cse_var_6 + 1590)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1377)]*kernel.shared_1[(cse_var_6 + 153)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1377)]*kernel.shared_1[(cse_var_6 + 1305)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1377)]*kernel.shared_1[(cse_var_6 + 441)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1377)]*kernel.shared_1[(cse_var_6 + 1593)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1386)]*kernel.shared_1[(cse_var_6 + 156)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1386)]*kernel.shared_1[(cse_var_6 + 1308)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1386)]*kernel.shared_1[(cse_var_6 + 444)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1386)]*kernel.shared_1[(cse_var_6 + 1596)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1395)]*kernel.shared_1[(cse_var_6 + 159)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1395)]*kernel.shared_1[(cse_var_6 + 1311)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1395)]*kernel.shared_1[(cse_var_6 + 447)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1395)]*kernel.shared_1[(cse_var_6 + 1599)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1458)]*kernel.shared_1[(cse_var_6 + 162)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1458)]*kernel.shared_1[(cse_var_6 + 1314)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1458)]*kernel.shared_1[(cse_var_6 + 450)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1458)]*kernel.shared_1[(cse_var_6 + 1602)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1467)]*kernel.shared_1[(cse_var_6 + 165)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1467)]*kernel.shared_1[(cse_var_6 + 1317)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1467)]*kernel.shared_1[(cse_var_6 + 453)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1467)]*kernel.shared_1[(cse_var_6 + 1605)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1476)]*kernel.shared_1[(cse_var_6 + 168)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1476)]*kernel.shared_1[(cse_var_6 + 1320)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1476)]*kernel.shared_1[(cse_var_6 + 456)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1476)]*kernel.shared_1[(cse_var_6 + 1608)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1539)]*kernel.shared_1[(cse_var_6 + 171)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1539)]*kernel.shared_1[(cse_var_6 + 1323)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1539)]*kernel.shared_1[(cse_var_6 + 459)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1539)]*kernel.shared_1[(cse_var_6 + 1611)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1548)]*kernel.shared_1[(cse_var_6 + 174)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1548)]*kernel.shared_1[(cse_var_6 + 1326)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1548)]*kernel.shared_1[(cse_var_6 + 462)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1548)]*kernel.shared_1[(cse_var_6 + 1614)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1557)]*kernel.shared_1[(cse_var_6 + 177)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1557)]*kernel.shared_1[(cse_var_6 + 1329)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1557)]*kernel.shared_1[(cse_var_6 + 465)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1557)]*kernel.shared_1[(cse_var_6 + 1617)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1620)]*kernel.shared_1[(cse_var_6 + 180)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1620)]*kernel.shared_1[(cse_var_6 + 1332)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1620)]*kernel.shared_1[(cse_var_6 + 468)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1620)]*kernel.shared_1[(cse_var_6 + 1620)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1629)]*kernel.shared_1[(cse_var_6 + 183)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1629)]*kernel.shared_1[(cse_var_6 + 1335)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1629)]*kernel.shared_1[(cse_var_6 + 471)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1629)]*kernel.shared_1[(cse_var_6 + 1623)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1638)]*kernel.shared_1[(cse_var_6 + 186)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1638)]*kernel.shared_1[(cse_var_6 + 1338)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1638)]*kernel.shared_1[(cse_var_6 + 474)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1638)]*kernel.shared_1[(cse_var_6 + 1626)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1701)]*kernel.shared_1[(cse_var_6 + 189)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1701)]*kernel.shared_1[(cse_var_6 + 1341)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1701)]*kernel.shared_1[(cse_var_6 + 477)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1701)]*kernel.shared_1[(cse_var_6 + 1629)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1710)]*kernel.shared_1[(cse_var_6 + 192)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1710)]*kernel.shared_1[(cse_var_6 + 1344)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1710)]*kernel.shared_1[(cse_var_6 + 480)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1710)]*kernel.shared_1[(cse_var_6 + 1632)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1719)]*kernel.shared_1[(cse_var_6 + 195)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1719)]*kernel.shared_1[(cse_var_6 + 1347)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1719)]*kernel.shared_1[(cse_var_6 + 483)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1719)]*kernel.shared_1[(cse_var_6 + 1635)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1782)]*kernel.shared_1[(cse_var_6 + 198)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1782)]*kernel.shared_1[(cse_var_6 + 1350)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1782)]*kernel.shared_1[(cse_var_6 + 486)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1782)]*kernel.shared_1[(cse_var_6 + 1638)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1791)]*kernel.shared_1[(cse_var_6 + 201)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1791)]*kernel.shared_1[(cse_var_6 + 1353)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1791)]*kernel.shared_1[(cse_var_6 + 489)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1791)]*kernel.shared_1[(cse_var_6 + 1641)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1800)]*kernel.shared_1[(cse_var_6 + 204)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1800)]*kernel.shared_1[(cse_var_6 + 1356)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1800)]*kernel.shared_1[(cse_var_6 + 492)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1800)]*kernel.shared_1[(cse_var_6 + 1644)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1863)]*kernel.shared_1[(cse_var_6 + 207)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1863)]*kernel.shared_1[(cse_var_6 + 1359)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1863)]*kernel.shared_1[(cse_var_6 + 495)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1863)]*kernel.shared_1[(cse_var_6 + 1647)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1872)]*kernel.shared_1[(cse_var_6 + 210)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1872)]*kernel.shared_1[(cse_var_6 + 1362)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1872)]*kernel.shared_1[(cse_var_6 + 498)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1872)]*kernel.shared_1[(cse_var_6 + 1650)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1881)]*kernel.shared_1[(cse_var_6 + 213)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1881)]*kernel.shared_1[(cse_var_6 + 1365)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1881)]*kernel.shared_1[(cse_var_6 + 501)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1881)]*kernel.shared_1[(cse_var_6 + 1653)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1944)]*kernel.shared_1[(cse_var_6 + 216)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1944)]*kernel.shared_1[(cse_var_6 + 1368)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1944)]*kernel.shared_1[(cse_var_6 + 504)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1944)]*kernel.shared_1[(cse_var_6 + 1656)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1953)]*kernel.shared_1[(cse_var_6 + 219)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1953)]*kernel.shared_1[(cse_var_6 + 1371)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1953)]*kernel.shared_1[(cse_var_6 + 507)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1953)]*kernel.shared_1[(cse_var_6 + 1659)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1962)]*kernel.shared_1[(cse_var_6 + 222)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1962)]*kernel.shared_1[(cse_var_6 + 1374)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1962)]*kernel.shared_1[(cse_var_6 + 510)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 1962)]*kernel.shared_1[(cse_var_6 + 1662)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2025)]*kernel.shared_1[(cse_var_6 + 225)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2025)]*kernel.shared_1[(cse_var_6 + 1377)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2025)]*kernel.shared_1[(cse_var_6 + 513)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2025)]*kernel.shared_1[(cse_var_6 + 1665)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2034)]*kernel.shared_1[(cse_var_6 + 228)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2034)]*kernel.shared_1[(cse_var_6 + 1380)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2034)]*kernel.shared_1[(cse_var_6 + 516)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2034)]*kernel.shared_1[(cse_var_6 + 1668)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2043)]*kernel.shared_1[(cse_var_6 + 231)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2043)]*kernel.shared_1[(cse_var_6 + 1383)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2043)]*kernel.shared_1[(cse_var_6 + 519)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2043)]*kernel.shared_1[(cse_var_6 + 1671)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2106)]*kernel.shared_1[(cse_var_6 + 234)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2106)]*kernel.shared_1[(cse_var_6 + 1386)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2106)]*kernel.shared_1[(cse_var_6 + 522)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2106)]*kernel.shared_1[(cse_var_6 + 1674)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2115)]*kernel.shared_1[(cse_var_6 + 237)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2115)]*kernel.shared_1[(cse_var_6 + 1389)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2115)]*kernel.shared_1[(cse_var_6 + 525)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2115)]*kernel.shared_1[(cse_var_6 + 1677)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2124)]*kernel.shared_1[(cse_var_6 + 240)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2124)]*kernel.shared_1[(cse_var_6 + 1392)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2124)]*kernel.shared_1[(cse_var_6 + 528)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2124)]*kernel.shared_1[(cse_var_6 + 1680)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2187)]*kernel.shared_1[(cse_var_6 + 243)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2187)]*kernel.shared_1[(cse_var_6 + 1395)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2187)]*kernel.shared_1[(cse_var_6 + 531)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2187)]*kernel.shared_1[(cse_var_6 + 1683)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2196)]*kernel.shared_1[(cse_var_6 + 246)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2196)]*kernel.shared_1[(cse_var_6 + 1398)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2196)]*kernel.shared_1[(cse_var_6 + 534)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2196)]*kernel.shared_1[(cse_var_6 + 1686)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2205)]*kernel.shared_1[(cse_var_6 + 249)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2205)]*kernel.shared_1[(cse_var_6 + 1401)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2205)]*kernel.shared_1[(cse_var_6 + 537)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2205)]*kernel.shared_1[(cse_var_6 + 1689)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2268)]*kernel.shared_1[(cse_var_6 + 252)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2268)]*kernel.shared_1[(cse_var_6 + 1404)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2268)]*kernel.shared_1[(cse_var_6 + 540)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2268)]*kernel.shared_1[(cse_var_6 + 1692)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2277)]*kernel.shared_1[(cse_var_6 + 255)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2277)]*kernel.shared_1[(cse_var_6 + 1407)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2277)]*kernel.shared_1[(cse_var_6 + 543)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2277)]*kernel.shared_1[(cse_var_6 + 1695)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2286)]*kernel.shared_1[(cse_var_6 + 258)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2286)]*kernel.shared_1[(cse_var_6 + 1410)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2286)]*kernel.shared_1[(cse_var_6 + 546)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2286)]*kernel.shared_1[(cse_var_6 + 1698)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2349)]*kernel.shared_1[(cse_var_6 + 261)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2349)]*kernel.shared_1[(cse_var_6 + 1413)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2349)]*kernel.shared_1[(cse_var_6 + 549)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2349)]*kernel.shared_1[(cse_var_6 + 1701)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2358)]*kernel.shared_1[(cse_var_6 + 264)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2358)]*kernel.shared_1[(cse_var_6 + 1416)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2358)]*kernel.shared_1[(cse_var_6 + 552)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2358)]*kernel.shared_1[(cse_var_6 + 1704)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2367)]*kernel.shared_1[(cse_var_6 + 267)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2367)]*kernel.shared_1[(cse_var_6 + 1419)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2367)]*kernel.shared_1[(cse_var_6 + 555)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2367)]*kernel.shared_1[(cse_var_6 + 1707)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2430)]*kernel.shared_1[(cse_var_6 + 270)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2430)]*kernel.shared_1[(cse_var_6 + 1422)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2430)]*kernel.shared_1[(cse_var_6 + 558)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2430)]*kernel.shared_1[(cse_var_6 + 1710)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2439)]*kernel.shared_1[(cse_var_6 + 273)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2439)]*kernel.shared_1[(cse_var_6 + 1425)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2439)]*kernel.shared_1[(cse_var_6 + 561)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2439)]*kernel.shared_1[(cse_var_6 + 1713)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2448)]*kernel.shared_1[(cse_var_6 + 276)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2448)]*kernel.shared_1[(cse_var_6 + 1428)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2448)]*kernel.shared_1[(cse_var_6 + 564)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2448)]*kernel.shared_1[(cse_var_6 + 1716)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2511)]*kernel.shared_1[(cse_var_6 + 279)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2511)]*kernel.shared_1[(cse_var_6 + 1431)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2511)]*kernel.shared_1[(cse_var_6 + 567)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2511)]*kernel.shared_1[(cse_var_6 + 1719)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2520)]*kernel.shared_1[(cse_var_6 + 282)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2520)]*kernel.shared_1[(cse_var_6 + 1434)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2520)]*kernel.shared_1[(cse_var_6 + 570)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2520)]*kernel.shared_1[(cse_var_6 + 1722)]))
-                  conv2d_nchw_1[cse_var_7] = (conv2d_nchw_1[cse_var_7] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2529)]*kernel.shared_1[(cse_var_6 + 285)]))
-                  conv2d_nchw_1[cse_var_4] = (conv2d_nchw_1[cse_var_4] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2529)]*kernel.shared_1[(cse_var_6 + 1437)]))
-                  conv2d_nchw_1[cse_var_3] = (conv2d_nchw_1[cse_var_3] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2529)]*kernel.shared_1[(cse_var_6 + 573)]))
-                  conv2d_nchw_1[cse_var_5] = (conv2d_nchw_1[cse_var_5] + (pad_temp.shared_1[((((floordiv(threadIdx.x, 7)*9) + rx.outer.inner) + floormod(threadIdx.x, 7)) + 2529)]*kernel.shared_1[(cse_var_6 + 1725)]))
+                  conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[((rc.outer.inner*126) + threadIdx.x)]*kernel.shared_1[cse_var_3]))
+                  conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[((rc.outer.inner*126) + threadIdx.x)]*kernel.shared_1[(cse_var_3 + 48)]))
+                  conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 63)]*kernel.shared_1[(cse_var_3 + 3)]))
+                  conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 63)]*kernel.shared_1[(cse_var_3 + 51)]))
+                  conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[((rc.outer.inner*126) + threadIdx.x)]*kernel.shared_1[(cse_var_3 + 96)]))
+                  conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[((rc.outer.inner*126) + threadIdx.x)]*kernel.shared_1[(cse_var_3 + 144)]))
+                  conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 63)]*kernel.shared_1[(cse_var_3 + 99)]))
+                  conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 63)]*kernel.shared_1[(cse_var_3 + 147)]))
+                  conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[((rc.outer.inner*126) + threadIdx.x)]*kernel.shared_1[(cse_var_3 + 192)]))
+                  conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[((rc.outer.inner*126) + threadIdx.x)]*kernel.shared_1[(cse_var_3 + 240)]))
+                  conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 63)]*kernel.shared_1[(cse_var_3 + 195)]))
+                  conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 63)]*kernel.shared_1[(cse_var_3 + 243)]))
+                  conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[((rc.outer.inner*126) + threadIdx.x)]*kernel.shared_1[(cse_var_3 + 288)]))
+                  conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[((rc.outer.inner*126) + threadIdx.x)]*kernel.shared_1[(cse_var_3 + 336)]))
+                  conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 63)]*kernel.shared_1[(cse_var_3 + 291)]))
+                  conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 63)]*kernel.shared_1[(cse_var_3 + 339)]))
+                  conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 7)]*kernel.shared_1[(cse_var_3 + 1)]))
+                  conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 7)]*kernel.shared_1[(cse_var_3 + 49)]))
+                  conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 70)]*kernel.shared_1[(cse_var_3 + 4)]))
+                  conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 70)]*kernel.shared_1[(cse_var_3 + 52)]))
+                  conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 7)]*kernel.shared_1[(cse_var_3 + 97)]))
+                  conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 7)]*kernel.shared_1[(cse_var_3 + 145)]))
+                  conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 70)]*kernel.shared_1[(cse_var_3 + 100)]))
+                  conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 70)]*kernel.shared_1[(cse_var_3 + 148)]))
+                  conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 7)]*kernel.shared_1[(cse_var_3 + 193)]))
+                  conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 7)]*kernel.shared_1[(cse_var_3 + 241)]))
+                  conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 70)]*kernel.shared_1[(cse_var_3 + 196)]))
+                  conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 70)]*kernel.shared_1[(cse_var_3 + 244)]))
+                  conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 7)]*kernel.shared_1[(cse_var_3 + 289)]))
+                  conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 7)]*kernel.shared_1[(cse_var_3 + 337)]))
+                  conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 70)]*kernel.shared_1[(cse_var_3 + 292)]))
+                  conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 70)]*kernel.shared_1[(cse_var_3 + 340)]))
+                  conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 14)]*kernel.shared_1[(cse_var_3 + 2)]))
+                  conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 14)]*kernel.shared_1[(cse_var_3 + 50)]))
+                  conv2d_nchw_1[0] = (conv2d_nchw_1[0] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 77)]*kernel.shared_1[(cse_var_3 + 5)]))
+                  conv2d_nchw_1[1] = (conv2d_nchw_1[1] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 77)]*kernel.shared_1[(cse_var_3 + 53)]))
+                  conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 14)]*kernel.shared_1[(cse_var_3 + 98)]))
+                  conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 14)]*kernel.shared_1[(cse_var_3 + 146)]))
+                  conv2d_nchw_1[2] = (conv2d_nchw_1[2] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 77)]*kernel.shared_1[(cse_var_3 + 101)]))
+                  conv2d_nchw_1[3] = (conv2d_nchw_1[3] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 77)]*kernel.shared_1[(cse_var_3 + 149)]))
+                  conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 14)]*kernel.shared_1[(cse_var_3 + 194)]))
+                  conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 14)]*kernel.shared_1[(cse_var_3 + 242)]))
+                  conv2d_nchw_1[4] = (conv2d_nchw_1[4] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 77)]*kernel.shared_1[(cse_var_3 + 197)]))
+                  conv2d_nchw_1[5] = (conv2d_nchw_1[5] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 77)]*kernel.shared_1[(cse_var_3 + 245)]))
+                  conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 14)]*kernel.shared_1[(cse_var_3 + 290)]))
+                  conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 14)]*kernel.shared_1[(cse_var_3 + 338)]))
+                  conv2d_nchw_1[6] = (conv2d_nchw_1[6] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 77)]*kernel.shared_1[(cse_var_3 + 293)]))
+                  conv2d_nchw_1[7] = (conv2d_nchw_1[7] + (pad_temp.shared_1[(((rc.outer.inner*126) + threadIdx.x) + 77)]*kernel.shared_1[(cse_var_3 + 341)]))
                 }
               }
             }
           }
         }
-        for (i1.inner: int32, 0, 4) {
+        for (i1.inner: int32, 0, 8) {
           compute[(((blockIdx.x*392) + (i1.inner*49)) + threadIdx.x)] = max((conv2d_nchw_1[i1.inner] + bias[((blockIdx.x*8) + i1.inner)]), 0f32)
-          compute[((((blockIdx.x*392) + (i1.inner*49)) + threadIdx.x) + 196)] = max((conv2d_nchw_1[(i1.inner + 4)] + bias[(((blockIdx.x*8) + i1.inner) + 4)]), 0f32)
         }
       }
     }
@@ -871,13 +387,13 @@ cooperative fetching, unrolling and operator fusion.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 135-138
+.. GENERATED FROM PYTHON SOURCE LINES 141-144
 
 Check correctness and evaluate performance
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 We build the binary and check its correctness and performance.
 
-.. GENERATED FROM PYTHON SOURCE LINES 138-165
+.. GENERATED FROM PYTHON SOURCE LINES 144-171
 
 .. code-block:: default
 
@@ -916,12 +432,12 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 0.339 ms
+    Execution time of this operator: 0.258 ms
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 166-171
+.. GENERATED FROM PYTHON SOURCE LINES 172-177
 
 Using the record file
 ^^^^^^^^^^^^^^^^^^^^^
@@ -929,13 +445,13 @@ During the search, all measurement records are dumped into the record
 file "conv2d.json". The measurement records can be used to re-apply search results,
 resume the search, and perform other analyses.
 
-.. GENERATED FROM PYTHON SOURCE LINES 173-176
+.. GENERATED FROM PYTHON SOURCE LINES 179-182
 
 Here is an example where we load the best schedule from a file,
 print the equivalent python schedule API and CUDA source code.
 They can be used for debugging and learning the behavior of the auto-scheduler.
 
-.. GENERATED FROM PYTHON SOURCE LINES 176-183
+.. GENERATED FROM PYTHON SOURCE LINES 182-189
 
 .. code-block:: default
 
@@ -965,9 +481,9 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     conv2d_nchw_nn_o_o_o_i, conv2d_nchw_nn_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_i, factor=1)
     conv2d_nchw_nn_o_o_o_o, conv2d_nchw_nn_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_nn_o_o_o_i, factor=1)
     conv2d_nchw_ff_o_i, conv2d_nchw_ff_i = s[conv2d_nchw].split(conv2d_nchw_ff, factor=2)
-    conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=2)
+    conv2d_nchw_ff_o_o_i, conv2d_nchw_ff_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_i, factor=4)
     conv2d_nchw_ff_o_o_o_i, conv2d_nchw_ff_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_i, factor=1)
-    conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=2)
+    conv2d_nchw_ff_o_o_o_o, conv2d_nchw_ff_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_ff_o_o_o_i, factor=1)
     conv2d_nchw_yy_o_i, conv2d_nchw_yy_i = s[conv2d_nchw].split(conv2d_nchw_yy, factor=1)
     conv2d_nchw_yy_o_o_i, conv2d_nchw_yy_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_i, factor=1)
     conv2d_nchw_yy_o_o_o_i, conv2d_nchw_yy_o_o_i = s[conv2d_nchw].split(conv2d_nchw_yy_o_o_i, factor=7)
@@ -976,19 +492,19 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     conv2d_nchw_xx_o_o_i, conv2d_nchw_xx_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_i, factor=1)
     conv2d_nchw_xx_o_o_o_i, conv2d_nchw_xx_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_i, factor=7)
     conv2d_nchw_xx_o_o_o_o, conv2d_nchw_xx_o_o_o_i = s[conv2d_nchw].split(conv2d_nchw_xx_o_o_o_i, factor=1)
-    conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=32)
-    conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=1)
-    conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=3)
-    conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=1)
+    conv2d_nchw_rc_o_i, conv2d_nchw_rc_i = s[conv2d_nchw].split(conv2d_nchw_rc, factor=2)
+    conv2d_nchw_rc_o_o, conv2d_nchw_rc_o_i = s[conv2d_nchw].split(conv2d_nchw_rc_o_i, factor=8)
+    conv2d_nchw_ry_o_i, conv2d_nchw_ry_i = s[conv2d_nchw].split(conv2d_nchw_ry, factor=1)
+    conv2d_nchw_ry_o_o, conv2d_nchw_ry_o_i = s[conv2d_nchw].split(conv2d_nchw_ry_o_i, factor=3)
     conv2d_nchw_rx_o_i, conv2d_nchw_rx_i = s[conv2d_nchw].split(conv2d_nchw_rx, factor=1)
-    conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=3)
+    conv2d_nchw_rx_o_o, conv2d_nchw_rx_o_i = s[conv2d_nchw].split(conv2d_nchw_rx_o_i, factor=1)
     s[conv2d_nchw].reorder(conv2d_nchw_nn_o_o_o_o, conv2d_nchw_ff_o_o_o_o, conv2d_nchw_yy_o_o_o_o, conv2d_nchw_xx_o_o_o_o, conv2d_nchw_nn_o_o_o_i, conv2d_nchw_ff_o_o_o_i, conv2d_nchw_yy_o_o_o_i, conv2d_nchw_xx_o_o_o_i, conv2d_nchw_nn_o_o_i, conv2d_nchw_ff_o_o_i, conv2d_nchw_yy_o_o_i, conv2d_nchw_xx_o_o_i, conv2d_nchw_rc_o_o, conv2d_nchw_ry_o_o, conv2d_nchw_rx_o_o, conv2d_nchw_rc_o_i, conv2d_nchw_ry_o_i, conv2d_nchw_rx_o_i, conv2d_nchw_nn_o_i, conv2d_nchw_ff_o_i, conv2d_nchw_yy_o_i, conv2 [...]
     compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
     compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
     compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
-    compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=4)
+    compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=8)
     compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=1)
-    compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=2)
+    compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
     compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
     compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
     compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
@@ -1020,7 +536,7 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
     pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=49)
     s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
-    s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 512)
+    s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "auto_unroll_max_step", 64)
     s[conv2d_nchw].pragma(conv2d_nchw_nn_o_o_o_o, "unroll_explicit", True)
 
     CUDA source code:
@@ -1040,516 +556,107 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
     #endif
     extern "C" __global__ void __launch_bounds__(49) default_function_kernel0(float* __restrict__ data, float* __restrict__ kernel, float* __restrict__ compute, float* __restrict__ bias) {
       float conv2d_nchw[8];
-      __shared__ float pad_temp_shared[2592];
-      __shared__ float kernel_shared[2304];
+      __shared__ float pad_temp_shared[1008];
+      __shared__ float kernel_shared[384];
       conv2d_nchw[0] = 0.000000e+00f;
-      conv2d_nchw[4] = 0.000000e+00f;
       conv2d_nchw[1] = 0.000000e+00f;
-      conv2d_nchw[5] = 0.000000e+00f;
       conv2d_nchw[2] = 0.000000e+00f;
-      conv2d_nchw[6] = 0.000000e+00f;
       conv2d_nchw[3] = 0.000000e+00f;
+      conv2d_nchw[4] = 0.000000e+00f;
+      conv2d_nchw[5] = 0.000000e+00f;
+      conv2d_nchw[6] = 0.000000e+00f;
       conv2d_nchw[7] = 0.000000e+00f;
-      for (int rc_outer_outer = 0; rc_outer_outer < 16; ++rc_outer_outer) {
-        __syncthreads();
-        pad_temp_shared[((int)threadIdx.x)] = ((((9 <= ((int)threadIdx.x)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[((((rc_outer_outer * 1568) + ((((int)threadIdx.x) / 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 49)] = (((((9 <= ((((int)threadIdx.x) + 49) % 81)) && (((((int)threadIdx.x) + 49) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 49) / 81) * 49)) + ((((((int)threadIdx.x) + 49) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 98)] = (((1 <= ((((int)threadIdx.x) + 8) % 9)) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 98) / 81) * 49)) + (((((int)threadIdx.x) + 17) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 147)] = (((((9 <= ((((int)threadIdx.x) + 66) % 81)) && (((((int)threadIdx.x) + 66) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 147) / 81) * 49)) + ((((((int)threadIdx.x) + 66) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 196)] = (((((9 <= ((((int)threadIdx.x) + 34) % 81)) && (((((int)threadIdx.x) + 34) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 196) / 81) * 49)) + ((((((int)threadIdx.x) + 34) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 245)] = ((((7 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 245) / 81) * 49)) + (((((int)threadIdx.x) + 2) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 294)] = (((((9 <= ((((int)threadIdx.x) + 51) % 81)) && (((((int)threadIdx.x) + 51) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 294) / 81) * 49)) + ((((((int)threadIdx.x) + 51) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 343)] = (((1 <= ((((int)threadIdx.x) + 1) % 9)) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 343) / 81) * 49)) + (((((int)threadIdx.x) + 19) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 392)] = (((((9 <= ((((int)threadIdx.x) + 68) % 81)) && (((((int)threadIdx.x) + 68) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 392) / 81) * 49)) + ((((((int)threadIdx.x) + 68) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 441)] = (((((1 <= (((((int)threadIdx.x) / 9) + 4) % 9)) && (((((int)threadIdx.x) + 36) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 441) / 81) * 49)) + ((((((int)threadIdx.x) / 9) + 4) % 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 490)] = ((((5 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 490) / 81) * 49)) + (((((int)threadIdx.x) + 4) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 539)] = (((((9 <= ((((int)threadIdx.x) + 53) % 81)) && (((((int)threadIdx.x) + 53) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 539) / 81) * 49)) + ((((((int)threadIdx.x) + 53) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 588)] = (((1 <= ((((int)threadIdx.x) + 3) % 9)) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 588) / 81) * 49)) + (((((int)threadIdx.x) + 21) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 637)] = (((((9 <= ((((int)threadIdx.x) + 70) % 81)) && (((((int)threadIdx.x) + 70) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 637) / 81) * 49)) + ((((((int)threadIdx.x) + 70) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 686)] = (((((9 <= ((((int)threadIdx.x) + 38) % 81)) && (((((int)threadIdx.x) + 38) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 686) / 81) * 49)) + ((((((int)threadIdx.x) + 38) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 735)] = ((((3 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 735) / 81) * 49)) + (((((int)threadIdx.x) + 6) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 784)] = (((((9 <= ((((int)threadIdx.x) + 55) % 81)) && (((((int)threadIdx.x) + 55) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 784) / 81) * 49)) + ((((((int)threadIdx.x) + 55) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 833)] = (((1 <= ((((int)threadIdx.x) + 5) % 9)) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 833) / 81) * 49)) + (((((int)threadIdx.x) + 23) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 882)] = (((((1 <= (((((int)threadIdx.x) / 9) + 8) % 9)) && (((((int)threadIdx.x) + 72) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 882) / 81) * 49)) + ((((((int)threadIdx.x) / 9) + 8) % 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 931)] = (((((9 <= ((((int)threadIdx.x) + 40) % 81)) && (((((int)threadIdx.x) + 40) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 931) / 81) * 49)) + ((((((int)threadIdx.x) + 40) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 980)] = ((((1 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 980) / 81) * 49)) + (((((int)threadIdx.x) + 8) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1029)] = (((((9 <= ((((int)threadIdx.x) + 57) % 81)) && (((((int)threadIdx.x) + 57) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1029) / 81) * 49)) + ((((((int)threadIdx.x) + 57) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1078)] = ((((((int)threadIdx.x) < 47) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1078) / 81) * 49)) + (((((int)threadIdx.x) + 25) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1127)] = (((((9 <= ((((int)threadIdx.x) + 74) % 81)) && (((((int)threadIdx.x) + 74) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1127) / 81) * 49)) + ((((((int)threadIdx.x) + 74) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1176)] = (((((9 <= ((((int)threadIdx.x) + 42) % 81)) && (((((int)threadIdx.x) + 42) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1176) / 81) * 49)) + ((((((int)threadIdx.x) + 42) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1225)] = (((1 <= ((((int)threadIdx.x) + 1) % 9)) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1225) / 81) * 49)) + (((((int)threadIdx.x) + 10) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1274)] = (((((9 <= ((((int)threadIdx.x) + 59) % 81)) && (((((int)threadIdx.x) + 59) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1274) / 81) * 49)) + ((((((int)threadIdx.x) + 59) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1323)] = ((((((int)threadIdx.x) < 45) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1323) / 81) * 49)) + ((((int)threadIdx.x) / 9) * 7)) + (((int)threadIdx.x) % 9)) + 13)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1372)] = (((((9 <= ((((int)threadIdx.x) + 76) % 81)) && (((((int)threadIdx.x) + 76) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1372) / 81) * 49)) + ((((((int)threadIdx.x) + 76) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1421)] = (((((9 <= ((((int)threadIdx.x) + 44) % 81)) && (((((int)threadIdx.x) + 44) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1421) / 81) * 49)) + ((((((int)threadIdx.x) + 44) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1470)] = (((1 <= ((((int)threadIdx.x) + 3) % 9)) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1470) / 81) * 49)) + (((((int)threadIdx.x) + 12) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1519)] = (((((9 <= ((((int)threadIdx.x) + 61) % 81)) && (((((int)threadIdx.x) + 61) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1519) / 81) * 49)) + ((((((int)threadIdx.x) + 61) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1568)] = ((((((int)threadIdx.x) < 43) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1568) / 81) * 49)) + (((((int)threadIdx.x) + 29) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1617)] = (((((9 <= ((((int)threadIdx.x) + 78) % 81)) && (((((int)threadIdx.x) + 78) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1617) / 81) * 49)) + ((((((int)threadIdx.x) + 78) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1666)] = (((((9 <= ((((int)threadIdx.x) + 46) % 81)) && (((((int)threadIdx.x) + 46) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1666) / 81) * 49)) + ((((((int)threadIdx.x) + 46) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1715)] = (((1 <= ((((int)threadIdx.x) + 5) % 9)) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1715) / 81) * 49)) + (((((int)threadIdx.x) + 14) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1764)] = (((((1 <= (((((int)threadIdx.x) / 9) + 7) % 9)) && (((((int)threadIdx.x) + 63) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1764) / 81) * 49)) + ((((((int)threadIdx.x) / 9) + 7) % 9) * 7)) + (((int)threadIdx.x) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1813)] = ((((((int)threadIdx.x) < 41) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1813) / 81) * 49)) + (((((int)threadIdx.x) + 31) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1862)] = (((((9 <= ((((int)threadIdx.x) + 80) % 81)) && (((((int)threadIdx.x) + 80) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1862) / 81) * 49)) + ((((((int)threadIdx.x) + 80) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1911)] = (((((9 <= ((((int)threadIdx.x) + 48) % 81)) && (((((int)threadIdx.x) + 48) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1911) / 81) * 49)) + ((((((int)threadIdx.x) + 48) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 1960)] = (((1 <= ((((int)threadIdx.x) + 7) % 9)) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 1960) / 81) * 49)) + (((((int)threadIdx.x) + 16) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 2009)] = (((((9 <= ((((int)threadIdx.x) + 65) % 81)) && (((((int)threadIdx.x) + 65) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2009) / 81) * 49)) + ((((((int)threadIdx.x) + 65) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 2058)] = (((((9 <= ((((int)threadIdx.x) + 33) % 81)) && (((((int)threadIdx.x) + 33) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2058) / 81) * 49)) + ((((((int)threadIdx.x) + 33) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 2107)] = ((((8 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2107) / 81) * 49)) + (((((int)threadIdx.x) + 1) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 2156)] = (((((9 <= ((((int)threadIdx.x) + 50) % 81)) && (((((int)threadIdx.x) + 50) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 5) % 9))) && (((((int)threadIdx.x) + 5) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2156) / 81) * 49)) + ((((((int)threadIdx.x) + 50) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 5) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 2205)] = (((1 <= (((int)threadIdx.x) % 9)) && ((((int)threadIdx.x) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2205) / 81) * 49)) + ((((int)threadIdx.x) / 9) * 7)) + (((int)threadIdx.x) % 9)) + 6)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 2254)] = (((((9 <= ((((int)threadIdx.x) + 67) % 81)) && (((((int)threadIdx.x) + 67) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2254) / 81) * 49)) + ((((((int)threadIdx.x) + 67) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 2303)] = (((((9 <= ((((int)threadIdx.x) + 35) % 81)) && (((((int)threadIdx.x) + 35) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2303) / 81) * 49)) + ((((((int)threadIdx.x) + 35) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 2352)] = ((((6 <= ((int)threadIdx.x)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2352) / 81) * 49)) + (((((int)threadIdx.x) + 3) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 2401)] = (((((9 <= ((((int)threadIdx.x) + 52) % 81)) && (((((int)threadIdx.x) + 52) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2401) / 81) * 49)) + ((((((int)threadIdx.x) + 52) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 2450)] = (((1 <= ((((int)threadIdx.x) + 2) % 9)) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2450) / 81) * 49)) + (((((int)threadIdx.x) + 20) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8)] : 0.000000e+00f);
-        pad_temp_shared[(((int)threadIdx.x) + 2499)] = (((((9 <= ((((int)threadIdx.x) + 69) % 81)) && (((((int)threadIdx.x) + 69) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 6) % 9))) && (((((int)threadIdx.x) + 6) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2499) / 81) * 49)) + ((((((int)threadIdx.x) + 69) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 6) % 9)) - 8)] : 0.000000e+00f);
-        if (((int)threadIdx.x) < 44) {
-          pad_temp_shared[(((int)threadIdx.x) + 2548)] = ((((((int)threadIdx.x) < 35) && (1 <= ((((int)threadIdx.x) + 1) % 9))) && (((((int)threadIdx.x) + 1) % 9) < 8)) ? data[(((((rc_outer_outer * 1568) + (((((int)threadIdx.x) + 2548) / 81) * 49)) + (((((int)threadIdx.x) + 37) / 9) * 7)) + ((((int)threadIdx.x) + 1) % 9)) - 8)] : 0.000000e+00f);
-        }
-        kernel_shared[((int)threadIdx.x)] = kernel[(((((int)blockIdx.x) * 36864) + (rc_outer_outer * 288)) + ((int)threadIdx.x))];
-        kernel_shared[(((int)threadIdx.x) + 49)] = kernel[((((((int)blockIdx.x) * 36864) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 49) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 98)] = kernel[((((((int)blockIdx.x) * 36864) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 98) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 147)] = kernel[((((((int)blockIdx.x) * 36864) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 147)];
-        kernel_shared[(((int)threadIdx.x) + 196)] = kernel[((((((int)blockIdx.x) * 36864) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 196) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 245)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 245) / 288) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 245) % 288) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 294)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 294) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 6)];
-        kernel_shared[(((int)threadIdx.x) + 343)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 343) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 55) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 392)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 392) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 104) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 441)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 441) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 153)];
-        kernel_shared[(((int)threadIdx.x) + 490)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 490) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 202) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 539)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 539) / 288) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 251) % 288) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 588)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 588) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 12)];
-        kernel_shared[(((int)threadIdx.x) + 637)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 637) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 61) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 686)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 686) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 110) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 735)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 735) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 159)];
-        kernel_shared[(((int)threadIdx.x) + 784)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 784) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 208) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 833)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 833) / 288) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 257) % 288) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 882)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 882) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 18)];
-        kernel_shared[(((int)threadIdx.x) + 931)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 931) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 67) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 980)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 980) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 116) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1029)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1029) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 165)];
-        kernel_shared[(((int)threadIdx.x) + 1078)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1078) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 214) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1127)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1127) / 288) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 263) % 288) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1176)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1176) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 24)];
-        kernel_shared[(((int)threadIdx.x) + 1225)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1225) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 73) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1274)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1274) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 122) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1323)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1323) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 171)];
-        kernel_shared[(((int)threadIdx.x) + 1372)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1372) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 220) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1421)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1421) / 288) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 269) % 288) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1470)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1470) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 30)];
-        kernel_shared[(((int)threadIdx.x) + 1519)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1519) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 79) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1568)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1568) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 128) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1617)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1617) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 177)];
-        kernel_shared[(((int)threadIdx.x) + 1666)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1666) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 226) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1715)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1715) / 288) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 275) % 288) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1764)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1764) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 36)];
-        kernel_shared[(((int)threadIdx.x) + 1813)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1813) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 85) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1862)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1862) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 134) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 1911)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1911) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 183)];
-        kernel_shared[(((int)threadIdx.x) + 1960)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 1960) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 232) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 2009)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 2009) / 288) * 4608)) + (rc_outer_outer * 288)) + ((((((int)threadIdx.x) + 281) % 288) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 2058)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 2058) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 42)];
-        kernel_shared[(((int)threadIdx.x) + 2107)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 2107) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 91) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 2156)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 2156) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 140) / 3) * 3)) + ((((int)threadIdx.x) + 2) % 3))];
-        kernel_shared[(((int)threadIdx.x) + 2205)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 2205) / 288) * 4608)) + (rc_outer_outer * 288)) + ((int)threadIdx.x)) + 189)];
-        kernel_shared[(((int)threadIdx.x) + 2254)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 2254) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 238) / 3) * 3)) + ((((int)threadIdx.x) + 1) % 3))];
-        if (((int)threadIdx.x) < 1) {
-          kernel_shared[(((int)threadIdx.x) + 2303)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 2303) / 288) * 4608)) + (rc_outer_outer * 288)) + (((((int)threadIdx.x) + 287) / 3) * 3)) + ((int)threadIdx.x)) + 2)];
-        }
-        __syncthreads();
-        for (int rx_outer_inner = 0; rx_outer_inner < 3; ++rx_outer_inner) {
-          for (int ff_outer_inner = 0; ff_outer_inner < 2; ++ff_outer_inner) {
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7))] * kernel_shared[((ff_outer_inner * 576) + rx_outer_inner)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7))] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1152)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7))] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 288)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7))] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1440)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 3)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1155)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 291)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 9)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1443)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 6)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1158)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 294)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 18)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1446)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 9)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1161)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 297)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 81)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1449)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 12)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1164)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 300)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 90)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1452)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 15)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1167)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 303)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 99)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1455)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 18)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1170)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 306)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 162)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1458)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 21)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1173)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 309)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 171)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1461)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 24)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1176)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 312)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 180)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1464)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 27)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1179)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 315)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 243)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1467)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 30)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1182)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 318)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 252)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1470)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 33)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1185)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 321)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 261)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1473)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 36)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1188)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 324)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 324)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1476)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 39)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1191)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 327)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 333)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1479)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 42)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1194)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 330)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 342)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1482)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 45)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1197)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 333)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 405)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1485)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 48)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1200)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 336)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 414)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1488)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 51)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1203)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 339)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 423)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1491)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 54)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1206)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 342)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 486)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1494)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 57)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1209)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 345)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 495)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1497)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 60)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1212)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 348)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 504)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1500)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 63)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1215)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 351)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 567)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1503)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 66)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1218)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 354)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 576)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1506)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 69)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1221)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 357)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 585)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1509)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 648)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 72)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 648)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1224)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 648)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 360)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 648)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1512)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 657)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 75)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 657)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1227)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 657)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 363)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 657)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1515)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 666)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 78)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 666)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1230)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 666)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 366)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 666)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1518)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 729)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 81)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 729)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1233)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 729)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 369)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 729)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1521)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 738)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 84)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 738)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1236)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 738)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 372)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 738)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1524)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 747)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 87)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 747)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1239)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 747)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 375)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 747)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1527)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 810)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 90)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 810)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1242)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 810)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 378)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 810)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1530)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 819)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 93)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 819)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1245)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 819)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 381)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 819)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1533)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 828)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 96)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 828)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1248)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 828)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 384)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 828)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1536)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 891)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 99)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 891)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1251)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 891)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 387)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 891)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1539)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 900)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 102)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 900)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1254)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 900)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 390)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 900)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1542)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 909)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 105)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 909)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1257)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 909)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 393)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 909)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1545)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 972)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 108)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 972)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1260)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 972)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 396)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 972)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1548)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 981)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 111)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 981)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1263)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 981)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 399)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 981)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1551)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 990)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 114)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 990)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1266)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 990)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 402)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 990)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1554)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1053)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 117)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1053)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1269)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1053)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 405)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1053)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1557)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1062)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 120)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1062)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1272)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1062)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 408)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1062)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1560)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1071)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 123)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1071)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1275)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1071)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 411)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1071)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1563)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1134)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 126)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1134)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1278)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1134)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 414)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1134)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1566)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1143)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 129)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1143)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1281)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1143)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 417)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1143)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1569)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1152)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 132)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1152)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1284)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1152)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 420)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1152)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1572)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1215)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 135)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1215)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1287)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1215)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 423)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1215)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1575)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1224)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 138)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1224)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1290)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1224)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 426)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1224)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1578)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1233)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 141)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1233)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1293)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1233)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 429)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1233)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1581)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1296)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 144)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1296)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1296)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1296)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 432)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1296)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1584)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1305)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 147)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1305)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1299)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1305)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 435)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1305)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1587)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1314)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 150)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1314)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1302)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1314)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 438)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1314)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1590)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1377)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 153)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1377)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1305)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1377)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 441)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1377)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1593)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1386)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 156)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1386)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1308)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1386)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 444)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1386)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1596)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1395)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 159)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1395)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1311)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1395)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 447)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1395)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1599)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1458)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 162)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1458)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1314)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1458)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 450)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1458)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1602)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1467)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 165)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1467)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1317)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1467)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 453)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1467)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1605)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1476)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 168)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1476)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1320)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1476)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 456)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1476)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1608)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1539)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 171)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1539)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1323)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1539)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 459)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1539)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1611)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1548)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 174)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1548)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1326)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1548)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 462)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1548)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1614)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1557)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 177)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1557)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1329)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1557)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 465)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1557)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1617)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1620)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 180)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1620)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1332)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1620)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 468)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1620)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1620)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1629)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 183)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1629)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1335)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1629)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 471)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1629)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1623)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1638)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 186)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1638)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1338)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1638)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 474)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1638)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1626)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1701)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 189)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1701)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1341)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1701)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 477)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1701)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1629)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1710)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 192)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1710)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1344)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1710)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 480)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1710)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1632)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1719)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 195)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1719)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1347)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1719)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 483)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1719)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1635)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1782)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 198)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1782)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1350)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1782)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 486)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1782)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1638)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1791)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 201)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1791)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1353)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1791)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 489)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1791)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1641)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1800)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 204)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1800)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1356)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1800)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 492)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1800)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1644)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1863)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 207)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1863)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1359)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1863)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 495)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1863)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1647)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1872)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 210)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1872)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1362)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1872)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 498)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1872)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1650)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1881)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 213)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1881)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1365)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1881)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 501)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1881)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1653)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1944)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 216)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1944)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1368)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1944)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 504)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1944)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1656)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1953)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 219)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1953)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1371)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1953)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 507)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1953)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1659)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1962)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 222)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1962)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1374)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1962)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 510)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 1962)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1662)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2025)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 225)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2025)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1377)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2025)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 513)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2025)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1665)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2034)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 228)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2034)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1380)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2034)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 516)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2034)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1668)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2043)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 231)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2043)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1383)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2043)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 519)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2043)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1671)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2106)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 234)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2106)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1386)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2106)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 522)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2106)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1674)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2115)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 237)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2115)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1389)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2115)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 525)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2115)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1677)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2124)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 240)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2124)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1392)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2124)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 528)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2124)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1680)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2187)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 243)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2187)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1395)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2187)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 531)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2187)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1683)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2196)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 246)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2196)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1398)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2196)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 534)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2196)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1686)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2205)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 249)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2205)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1401)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2205)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 537)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2205)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1689)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2268)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 252)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2268)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1404)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2268)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 540)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2268)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1692)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2277)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 255)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2277)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1407)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2277)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 543)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2277)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1695)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2286)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 258)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2286)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1410)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2286)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 546)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2286)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1698)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2349)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 261)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2349)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1413)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2349)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 549)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2349)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1701)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2358)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 264)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2358)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1416)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2358)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 552)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2358)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1704)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2367)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 267)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2367)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1419)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2367)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 555)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2367)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1707)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2430)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 270)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2430)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1422)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2430)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 558)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2430)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1710)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2439)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 273)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2439)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1425)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2439)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 561)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2439)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1713)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2448)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 276)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2448)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1428)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2448)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 564)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2448)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1716)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2511)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 279)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2511)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1431)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2511)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 567)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2511)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1719)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2520)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 282)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2520)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1434)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2520)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 570)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2520)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1722)]));
-            conv2d_nchw[(ff_outer_inner * 2)] = (conv2d_nchw[(ff_outer_inner * 2)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2529)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 285)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 4)] = (conv2d_nchw[((ff_outer_inner * 2) + 4)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2529)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1437)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 1)] = (conv2d_nchw[((ff_outer_inner * 2) + 1)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2529)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 573)]));
-            conv2d_nchw[((ff_outer_inner * 2) + 5)] = (conv2d_nchw[((ff_outer_inner * 2) + 5)] + (pad_temp_shared[(((((((int)threadIdx.x) / 7) * 9) + rx_outer_inner) + (((int)threadIdx.x) % 7)) + 2529)] * kernel_shared[(((ff_outer_inner * 576) + rx_outer_inner) + 1725)]));
+      for (int rc_outer_outer = 0; rc_outer_outer < 32; ++rc_outer_outer) {
+        for (int rx_outer_outer = 0; rx_outer_outer < 3; ++rx_outer_outer) {
+          __syncthreads();
+          pad_temp_shared[((int)threadIdx.x)] = ((((7 <= ((int)threadIdx.x)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((rc_outer_outer * 784) + ((int)threadIdx.x)) + rx_outer_outer) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 49)] = (((((1 <= (((((int)threadIdx.x) / 7) + 7) % 9)) && ((((((int)threadIdx.x) / 7) + 7) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 49) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 7) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 98)] = (((((1 <= (((((int)threadIdx.x) / 7) + 5) % 9)) && ((((((int)threadIdx.x) / 7) + 5) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 98) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 5) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 147)] = (((((1 <= (((((int)threadIdx.x) / 7) + 3) % 9)) && ((((((int)threadIdx.x) / 7) + 3) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 147) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 3) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 196)] = (((1 <= (rx_outer_outer + (((int)threadIdx.x) % 7))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 196) / 63) * 49)) + ((int)threadIdx.x)) + rx_outer_outer) - 1)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 245)] = (((((1 <= (((((int)threadIdx.x) / 7) + 8) % 9)) && ((((((int)threadIdx.x) / 7) + 8) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 245) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 8) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 294)] = (((((1 <= (((((int)threadIdx.x) / 7) + 6) % 9)) && ((((((int)threadIdx.x) / 7) + 6) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 294) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 6) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 343)] = (((((1 <= (((((int)threadIdx.x) / 7) + 4) % 9)) && ((((((int)threadIdx.x) / 7) + 4) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 343) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 4) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 392)] = ((((((int)threadIdx.x) < 42) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 392) / 63) * 49)) + ((int)threadIdx.x)) + rx_outer_outer) + 6)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 441)] = ((((7 <= ((int)threadIdx.x)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((rc_outer_outer * 784) + ((int)threadIdx.x)) + rx_outer_outer) + 335)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 490)] = (((((1 <= (((((int)threadIdx.x) / 7) + 7) % 9)) && ((((((int)threadIdx.x) / 7) + 7) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 490) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 7) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 539)] = (((((1 <= (((((int)threadIdx.x) / 7) + 5) % 9)) && ((((((int)threadIdx.x) / 7) + 5) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 539) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 5) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 588)] = (((((1 <= (((((int)threadIdx.x) / 7) + 3) % 9)) && ((((((int)threadIdx.x) / 7) + 3) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 588) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 3) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 637)] = (((1 <= (rx_outer_outer + (((int)threadIdx.x) % 7))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 637) / 63) * 49)) + ((int)threadIdx.x)) + rx_outer_outer) - 1)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 686)] = (((((1 <= (((((int)threadIdx.x) / 7) + 8) % 9)) && ((((((int)threadIdx.x) / 7) + 8) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 686) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 8) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 735)] = (((((1 <= (((((int)threadIdx.x) / 7) + 6) % 9)) && ((((((int)threadIdx.x) / 7) + 6) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 735) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 6) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 784)] = (((((1 <= (((((int)threadIdx.x) / 7) + 4) % 9)) && ((((((int)threadIdx.x) / 7) + 4) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 784) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 4) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 833)] = ((((((int)threadIdx.x) < 42) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 833) / 63) * 49)) + ((int)threadIdx.x)) + rx_outer_outer) + 6)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 882)] = ((((7 <= ((int)threadIdx.x)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((rc_outer_outer * 784) + ((int)threadIdx.x)) + rx_outer_outer) + 678)] : 0.000000e+00f);
+          pad_temp_shared[(((int)threadIdx.x) + 931)] = (((((1 <= (((((int)threadIdx.x) / 7) + 7) % 9)) && ((((((int)threadIdx.x) / 7) + 7) % 9) < 8)) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 931) / 63) * 49)) + ((((((int)threadIdx.x) / 7) + 7) % 9) * 7)) + rx_outer_outer) + (((int)threadIdx.x) % 7)) - 8)] : 0.000000e+00f);
+          if (((int)threadIdx.x) < 28) {
+            pad_temp_shared[(((int)threadIdx.x) + 980)] = ((((((int)threadIdx.x) < 21) && (1 <= (rx_outer_outer + (((int)threadIdx.x) % 7)))) && ((rx_outer_outer + (((int)threadIdx.x) % 7)) < 8)) ? data[(((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 980) / 63) * 49)) + ((int)threadIdx.x)) + rx_outer_outer) + 27)] : 0.000000e+00f);
+          }
+          kernel_shared[((int)threadIdx.x)] = kernel[(((((((int)blockIdx.x) * 36864) + ((((int)threadIdx.x) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((int)threadIdx.x) % 48) * 3)) + rx_outer_outer)];
+          kernel_shared[(((int)threadIdx.x) + 49)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 49) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 1) % 48) * 3)) + rx_outer_outer)];
+          kernel_shared[(((int)threadIdx.x) + 98)] = kernel[(((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 98) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 2) % 48) * 3)) + rx_outer_outer)];
+          kernel_shared[(((int)threadIdx.x) + 147)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 147) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) / 3) + 1) & 15) * 9)) + ((((int)threadIdx.x) % 3) * 3)) + rx_outer_outer)];
+          kernel_shared[(((int)threadIdx.x) + 196)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 196) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 4) % 48) / 3) * 9)) + (((((int)threadIdx.x) + 1) % 3) * 3)) + rx_outer_outer)];
+          kernel_shared[(((int)threadIdx.x) + 245)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 245) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) + 5) % 48) / 3) * 9)) + (((((int)threadIdx.x) + 2) % 3) * 3)) + rx_outer_outer)];
+          kernel_shared[(((int)threadIdx.x) + 294)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 294) / 48) * 4608)) + (rc_outer_outer * 144)) + ((((((int)threadIdx.x) / 3) + 2) & 15) * 9)) + ((((int)threadIdx.x) % 3) * 3)) + rx_outer_outer)];
+          if (((int)threadIdx.x) < 41) {
+            kernel_shared[(((int)threadIdx.x) + 343)] = kernel[((((((((int)blockIdx.x) * 36864) + (((((int)threadIdx.x) + 343) / 48) * 4608)) + (rc_outer_outer * 144)) + (((((int)threadIdx.x) + 7) / 3) * 9)) + (((((int)threadIdx.x) + 1) % 3) * 3)) + rx_outer_outer)];
+          }
+          __syncthreads();
+          for (int rc_outer_inner = 0; rc_outer_inner < 8; ++rc_outer_inner) {
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[((rc_outer_inner * 126) + ((int)threadIdx.x))] * kernel_shared[(rc_outer_inner * 6)]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[((rc_outer_inner * 126) + ((int)threadIdx.x))] * kernel_shared[((rc_outer_inner * 6) + 48)]));
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 63)] * kernel_shared[((rc_outer_inner * 6) + 3)]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 63)] * kernel_shared[((rc_outer_inner * 6) + 51)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[((rc_outer_inner * 126) + ((int)threadIdx.x))] * kernel_shared[((rc_outer_inner * 6) + 96)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[((rc_outer_inner * 126) + ((int)threadIdx.x))] * kernel_shared[((rc_outer_inner * 6) + 144)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 63)] * kernel_shared[((rc_outer_inner * 6) + 99)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 63)] * kernel_shared[((rc_outer_inner * 6) + 147)]));
+            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[((rc_outer_inner * 126) + ((int)threadIdx.x))] * kernel_shared[((rc_outer_inner * 6) + 192)]));
+            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[((rc_outer_inner * 126) + ((int)threadIdx.x))] * kernel_shared[((rc_outer_inner * 6) + 240)]));
+            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 63)] * kernel_shared[((rc_outer_inner * 6) + 195)]));
+            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 63)] * kernel_shared[((rc_outer_inner * 6) + 243)]));
+            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[((rc_outer_inner * 126) + ((int)threadIdx.x))] * kernel_shared[((rc_outer_inner * 6) + 288)]));
+            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[((rc_outer_inner * 126) + ((int)threadIdx.x))] * kernel_shared[((rc_outer_inner * 6) + 336)]));
+            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 63)] * kernel_shared[((rc_outer_inner * 6) + 291)]));
+            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 63)] * kernel_shared[((rc_outer_inner * 6) + 339)]));
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 7)] * kernel_shared[((rc_outer_inner * 6) + 1)]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 7)] * kernel_shared[((rc_outer_inner * 6) + 49)]));
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 70)] * kernel_shared[((rc_outer_inner * 6) + 4)]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 70)] * kernel_shared[((rc_outer_inner * 6) + 52)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 7)] * kernel_shared[((rc_outer_inner * 6) + 97)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 7)] * kernel_shared[((rc_outer_inner * 6) + 145)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 70)] * kernel_shared[((rc_outer_inner * 6) + 100)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 70)] * kernel_shared[((rc_outer_inner * 6) + 148)]));
+            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 7)] * kernel_shared[((rc_outer_inner * 6) + 193)]));
+            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 7)] * kernel_shared[((rc_outer_inner * 6) + 241)]));
+            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 70)] * kernel_shared[((rc_outer_inner * 6) + 196)]));
+            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 70)] * kernel_shared[((rc_outer_inner * 6) + 244)]));
+            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 7)] * kernel_shared[((rc_outer_inner * 6) + 289)]));
+            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 7)] * kernel_shared[((rc_outer_inner * 6) + 337)]));
+            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 70)] * kernel_shared[((rc_outer_inner * 6) + 292)]));
+            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 70)] * kernel_shared[((rc_outer_inner * 6) + 340)]));
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 14)] * kernel_shared[((rc_outer_inner * 6) + 2)]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 14)] * kernel_shared[((rc_outer_inner * 6) + 50)]));
+            conv2d_nchw[0] = (conv2d_nchw[0] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 77)] * kernel_shared[((rc_outer_inner * 6) + 5)]));
+            conv2d_nchw[1] = (conv2d_nchw[1] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 77)] * kernel_shared[((rc_outer_inner * 6) + 53)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 14)] * kernel_shared[((rc_outer_inner * 6) + 98)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 14)] * kernel_shared[((rc_outer_inner * 6) + 146)]));
+            conv2d_nchw[2] = (conv2d_nchw[2] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 77)] * kernel_shared[((rc_outer_inner * 6) + 101)]));
+            conv2d_nchw[3] = (conv2d_nchw[3] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 77)] * kernel_shared[((rc_outer_inner * 6) + 149)]));
+            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 14)] * kernel_shared[((rc_outer_inner * 6) + 194)]));
+            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 14)] * kernel_shared[((rc_outer_inner * 6) + 242)]));
+            conv2d_nchw[4] = (conv2d_nchw[4] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 77)] * kernel_shared[((rc_outer_inner * 6) + 197)]));
+            conv2d_nchw[5] = (conv2d_nchw[5] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 77)] * kernel_shared[((rc_outer_inner * 6) + 245)]));
+            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 14)] * kernel_shared[((rc_outer_inner * 6) + 290)]));
+            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 14)] * kernel_shared[((rc_outer_inner * 6) + 338)]));
+            conv2d_nchw[6] = (conv2d_nchw[6] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 77)] * kernel_shared[((rc_outer_inner * 6) + 293)]));
+            conv2d_nchw[7] = (conv2d_nchw[7] + (pad_temp_shared[(((rc_outer_inner * 126) + ((int)threadIdx.x)) + 77)] * kernel_shared[((rc_outer_inner * 6) + 341)]));
           }
         }
       }
-      for (int i1_inner = 0; i1_inner < 4; ++i1_inner) {
+      for (int i1_inner = 0; i1_inner < 8; ++i1_inner) {
         compute[(((((int)blockIdx.x) * 392) + (i1_inner * 49)) + ((int)threadIdx.x))] = max((conv2d_nchw[i1_inner] + bias[((((int)blockIdx.x) * 8) + i1_inner)]), 0.000000e+00f);
-        compute[((((((int)blockIdx.x) * 392) + (i1_inner * 49)) + ((int)threadIdx.x)) + 196)] = max((conv2d_nchw[(i1_inner + 4)] + bias[(((((int)blockIdx.x) * 8) + i1_inner) + 4)]), 0.000000e+00f);
       }
     }
 
@@ -1558,14 +665,14 @@ They can be used for debugging and learning the behavior of the auto-scheduler.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 184-188
+.. GENERATED FROM PYTHON SOURCE LINES 190-194
 
 A more complicated example is to resume the search.
 In this case, we need to create the search policy and cost model by ourselves
 and resume the status of search policy and cost model with the log file.
 In the example below we resume the status and do more 5 trials.
 
-.. GENERATED FROM PYTHON SOURCE LINES 188-210
+.. GENERATED FROM PYTHON SOURCE LINES 194-216
 
 .. code-block:: default
 
@@ -1611,7 +718,7 @@ In the example below we resume the status and do more 5 trials.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  42.181 seconds)
+   **Total running time of the script:** ( 2 minutes  46.115 seconds)
 
 
 .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_conv2d_layer_cuda.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_arm.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_arm.rst.txt
index e929ed84d..4e85b9ecb 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_arm.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_arm.rst.txt
@@ -46,11 +46,12 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 
-.. GENERATED FROM PYTHON SOURCE LINES 48-59
+.. GENERATED FROM PYTHON SOURCE LINES 48-60
 
 .. code-block:: default
 
 
+
     import numpy as np
     import os
 
@@ -68,7 +69,7 @@ __name__ == "__main__":` block.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 60-72
+.. GENERATED FROM PYTHON SOURCE LINES 66-78
 
 Define a Network
 ----------------
@@ -83,7 +84,7 @@ We also implemented more optimizations for NHWC layout with the auto-scheduler.
 So it is recommended to convert your models to NHWC layout to use the auto-scheduler.
 You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the layout conversion in TVM.
 
-.. GENERATED FROM PYTHON SOURCE LINES 72-149
+.. GENERATED FROM PYTHON SOURCE LINES 78-155
 
 .. code-block:: default
 
@@ -171,7 +172,7 @@ You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the layout co
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 150-174
+.. GENERATED FROM PYTHON SOURCE LINES 156-180
 
 Start RPC Tracker
 -----------------
@@ -198,7 +199,7 @@ The expected output is
 
   INFO:RPCTracker:bind to 0.0.0.0:9190
 
-.. GENERATED FROM PYTHON SOURCE LINES 176-218
+.. GENERATED FROM PYTHON SOURCE LINES 182-224
 
 Register Devices to RPC Tracker
 -----------------------------------
@@ -243,7 +244,7 @@ the output can be
 
 You can register multiple devices to the tracker to accelerate the measurement in tuning.
 
-.. GENERATED FROM PYTHON SOURCE LINES 220-226
+.. GENERATED FROM PYTHON SOURCE LINES 226-232
 
 Set Tuning Options
 ------------------
@@ -252,7 +253,7 @@ as example with a 64bit OS (Ubuntu 20.04). In your setting, you should modify th
 and device_key accordingly.
 set :code:`use_ndk` to True if you use android phone.
 
-.. GENERATED FROM PYTHON SOURCE LINES 226-253
+.. GENERATED FROM PYTHON SOURCE LINES 232-259
 
 .. code-block:: default
 
@@ -290,7 +291,7 @@ set :code:`use_ndk` to True if you use android phone.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 254-263
+.. GENERATED FROM PYTHON SOURCE LINES 260-269
 
 Extract Search Tasks
 --------------------
@@ -302,7 +303,7 @@ as :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the
 latency of a task and :code:`weight[t]` is the weight of the task.
 The task scheduler will just optimize this objective.
 
-.. GENERATED FROM PYTHON SOURCE LINES 263-277
+.. GENERATED FROM PYTHON SOURCE LINES 269-283
 
 .. code-block:: default
 
@@ -528,7 +529,7 @@ The task scheduler will just optimize this objective.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 278-297
+.. GENERATED FROM PYTHON SOURCE LINES 284-303
 
 Tuning and Evaluation
 ---------------------
@@ -550,7 +551,7 @@ After auto-tuning, we can compile the network with the best schedules we found.
 All measurement records are dumped into the log file during auto-tuning,
 so we can read the log file and load the best schedules.
 
-.. GENERATED FROM PYTHON SOURCE LINES 297-362
+.. GENERATED FROM PYTHON SOURCE LINES 303-368
 
 .. code-block:: default
 
@@ -626,7 +627,7 @@ so we can read the log file and load the best schedules.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 363-414
+.. GENERATED FROM PYTHON SOURCE LINES 369-420
 
 .. note:: Explaining the printed information during tuning
 
@@ -680,7 +681,7 @@ so we can read the log file and load the best schedules.
   errors are isolated from the main process.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 416-422
+.. GENERATED FROM PYTHON SOURCE LINES 422-428
 
 .. note:: Terminate the tuning earlier
 
@@ -689,7 +690,7 @@ so we can read the log file and load the best schedules.
   you should be able to do the compilation (the secion below).
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 424-440
+.. GENERATED FROM PYTHON SOURCE LINES 430-446
 
 Other Tips
 ----------
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
index b54d6f802..05b1770bb 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_cuda.rst.txt
@@ -46,11 +46,12 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 
-.. GENERATED FROM PYTHON SOURCE LINES 46-54
+.. GENERATED FROM PYTHON SOURCE LINES 46-55
 
 .. code-block:: default
 
 
+
     import numpy as np
 
     import tvm
@@ -65,7 +66,7 @@ __name__ == "__main__":` block.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 55-67
+.. GENERATED FROM PYTHON SOURCE LINES 61-73
 
 Define a Network
 ----------------
@@ -80,7 +81,7 @@ We also implemented more optimizations for NHWC layout with the auto-scheduler.
 So it is recommended to convert your models to NHWC layout to use the auto-scheduler.
 You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the layout conversion in TVM.
 
-.. GENERATED FROM PYTHON SOURCE LINES 67-141
+.. GENERATED FROM PYTHON SOURCE LINES 73-147
 
 .. code-block:: default
 
@@ -165,7 +166,7 @@ You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the layout co
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 142-151
+.. GENERATED FROM PYTHON SOURCE LINES 148-157
 
 Extract Search Tasks
 --------------------
@@ -177,7 +178,7 @@ as :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the
 latency of a task and :code:`weight[t]` is the weight of the task.
 The task scheduler will just optimize this objective.
 
-.. GENERATED FROM PYTHON SOURCE LINES 151-161
+.. GENERATED FROM PYTHON SOURCE LINES 157-167
 
 .. code-block:: default
 
@@ -479,7 +480,7 @@ The task scheduler will just optimize this objective.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 162-184
+.. GENERATED FROM PYTHON SOURCE LINES 168-190
 
 Begin Tuning
 ------------
@@ -504,7 +505,7 @@ Now, we set some options for tuning and launch the search tasks
   :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 184-206
+.. GENERATED FROM PYTHON SOURCE LINES 190-212
 
 .. code-block:: default
 
@@ -537,7 +538,7 @@ Now, we set some options for tuning and launch the search tasks
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 207-260
+.. GENERATED FROM PYTHON SOURCE LINES 213-266
 
 .. note:: Explain the printed information during tuning
 
@@ -593,7 +594,7 @@ Now, we set some options for tuning and launch the search tasks
   errors are isolated from the main process.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 262-268
+.. GENERATED FROM PYTHON SOURCE LINES 268-274
 
 .. note:: Terminate the tuning earlier
 
@@ -602,7 +603,7 @@ Now, we set some options for tuning and launch the search tasks
   you should be able to do the compilation (the secion below).
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 271-276
+.. GENERATED FROM PYTHON SOURCE LINES 277-282
 
 Compile and Evaluate
 --------------------
@@ -610,7 +611,7 @@ After auto-tuning, we can compile the network with the best schedules we found.
 All measurement records are dumped into the log file during auto-tuning,
 so we can read the log file and load the best schedules.
 
-.. GENERATED FROM PYTHON SOURCE LINES 276-294
+.. GENERATED FROM PYTHON SOURCE LINES 282-300
 
 .. code-block:: default
 
@@ -646,13 +647,13 @@ so we can read the log file and load the best schedules.
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-       9.8859       9.9108       9.9451       9.8016       0.0612   
+       9.9443       9.9574       9.9875       9.8880       0.0417   
                
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 295-311
+.. GENERATED FROM PYTHON SOURCE LINES 301-317
 
 Other Tips
 ----------
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_mali.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_mali.rst.txt
index b08c35d19..68360225f 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_mali.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_mali.rst.txt
@@ -46,11 +46,12 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 
-.. GENERATED FROM PYTHON SOURCE LINES 46-55
+.. GENERATED FROM PYTHON SOURCE LINES 46-56
 
 .. code-block:: default
 
 
+
     import numpy as np
 
     import tvm
@@ -66,7 +67,7 @@ __name__ == "__main__":` block.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 56-68
+.. GENERATED FROM PYTHON SOURCE LINES 62-74
 
 Define a Network
 ----------------
@@ -81,7 +82,7 @@ We also implemented more optimizations for NHWC layout with the auto-scheduler.
 So it is recommended to convert your models to NHWC layout to use the auto-scheduler.
 You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the layout conversion in TVM.
 
-.. GENERATED FROM PYTHON SOURCE LINES 68-147
+.. GENERATED FROM PYTHON SOURCE LINES 74-153
 
 .. code-block:: default
 
@@ -171,7 +172,7 @@ You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the layout co
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 148-153
+.. GENERATED FROM PYTHON SOURCE LINES 154-159
 
 Start an RPC Tracker and Register Devices to the Tracker
 --------------------------------------------------------
@@ -179,7 +180,7 @@ Please refer to the "Start RPC Tracker" and "Register Devices to RPC Tracker" se
 in this :ref:`tutorial <tutorials-autotvm-start-rpc-tracker>` to start an RPC tracker
 and register devices to the tracker.
 
-.. GENERATED FROM PYTHON SOURCE LINES 153-158
+.. GENERATED FROM PYTHON SOURCE LINES 159-164
 
 .. code-block:: default
 
@@ -195,7 +196,7 @@ and register devices to the tracker.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 159-168
+.. GENERATED FROM PYTHON SOURCE LINES 165-174
 
 Extract Search Tasks
 --------------------
@@ -207,7 +208,7 @@ as :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the
 latency of a task and :code:`weight[t]` is the weight of the task.
 The task scheduler will just optimize this objective.
 
-.. GENERATED FROM PYTHON SOURCE LINES 168-177
+.. GENERATED FROM PYTHON SOURCE LINES 174-183
 
 .. code-block:: default
 
@@ -427,7 +428,7 @@ The task scheduler will just optimize this objective.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 178-204
+.. GENERATED FROM PYTHON SOURCE LINES 184-210
 
 .. note:: How to get the hardware parameters from remote device
 
@@ -456,7 +457,7 @@ The task scheduler will just optimize this objective.
    )
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 206-222
+.. GENERATED FROM PYTHON SOURCE LINES 212-228
 
 Tuning and Evaluate
 -------------------
@@ -475,7 +476,7 @@ Now, we set some options for tuning, launch the search tasks and evaluate the en
   :any:`auto_scheduler.LocalRunner` for more parameters.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 222-275
+.. GENERATED FROM PYTHON SOURCE LINES 228-281
 
 .. code-block:: default
 
@@ -539,7 +540,7 @@ Now, we set some options for tuning, launch the search tasks and evaluate the en
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 276-334
+.. GENERATED FROM PYTHON SOURCE LINES 282-340
 
 .. note:: Explain the printed information during tuning
 
@@ -600,7 +601,7 @@ Now, we set some options for tuning, launch the search tasks and evaluate the en
   errors are isolated from the main process.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 336-342
+.. GENERATED FROM PYTHON SOURCE LINES 342-348
 
 .. note:: Terminate the tuning earlier
 
@@ -609,7 +610,7 @@ Now, we set some options for tuning, launch the search tasks and evaluate the en
   you should be able to do the compilation (the secion below).
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 344-360
+.. GENERATED FROM PYTHON SOURCE LINES 350-366
 
 Other Tips
 ----------
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
index 017dcc93e..454a48fb3 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_network_x86.rst.txt
@@ -46,11 +46,12 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 
-.. GENERATED FROM PYTHON SOURCE LINES 47-56
+.. GENERATED FROM PYTHON SOURCE LINES 47-57
 
 .. code-block:: default
 
 
+
     import numpy as np
 
     import tvm
@@ -66,7 +67,7 @@ __name__ == "__main__":` block.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 57-69
+.. GENERATED FROM PYTHON SOURCE LINES 63-75
 
 Define a Network
 ----------------
@@ -81,7 +82,7 @@ We also implemented more optimizations for NHWC layout with the auto-scheduler.
 So it is recommended to convert your models to NHWC layout to use the auto-scheduler.
 You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the layout conversion in TVM.
 
-.. GENERATED FROM PYTHON SOURCE LINES 69-157
+.. GENERATED FROM PYTHON SOURCE LINES 75-163
 
 .. code-block:: default
 
@@ -180,7 +181,7 @@ You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the layout co
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 158-167
+.. GENERATED FROM PYTHON SOURCE LINES 164-173
 
 Extract Search Tasks
 --------------------
@@ -192,7 +193,7 @@ as :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the
 latency of a task and :code:`weight[t]` is the weight of the task.
 The task scheduler will just optimize this objective.
 
-.. GENERATED FROM PYTHON SOURCE LINES 167-184
+.. GENERATED FROM PYTHON SOURCE LINES 173-190
 
 .. code-block:: default
 
@@ -487,7 +488,7 @@ The task scheduler will just optimize this objective.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 185-201
+.. GENERATED FROM PYTHON SOURCE LINES 191-207
 
 Begin Tuning
 ------------
@@ -506,7 +507,7 @@ Now, we set some options for tuning and launch the search tasks
   :any:`auto_scheduler.LocalRunner` for more parameters.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 201-235
+.. GENERATED FROM PYTHON SOURCE LINES 207-241
 
 .. code-block:: default
 
@@ -551,7 +552,7 @@ Now, we set some options for tuning and launch the search tasks
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 236-294
+.. GENERATED FROM PYTHON SOURCE LINES 242-300
 
 .. note:: Explain the printed information during tuning
 
@@ -612,7 +613,7 @@ Now, we set some options for tuning and launch the search tasks
   errors are isolated from the main process.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 296-302
+.. GENERATED FROM PYTHON SOURCE LINES 302-308
 
 .. note:: Terminate the tuning earlier
 
@@ -621,7 +622,7 @@ Now, we set some options for tuning and launch the search tasks
   you should be able to do the compilation (the secion below).
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 305-310
+.. GENERATED FROM PYTHON SOURCE LINES 311-316
 
 Compile and Evaluate
 --------------------
@@ -629,7 +630,7 @@ After auto-tuning, we can compile the network with the best schedules we found.
 All measurement records are dumped into the log file during auto-tuning,
 so we can read the log file and load the best schedules.
 
-.. GENERATED FROM PYTHON SOURCE LINES 310-328
+.. GENERATED FROM PYTHON SOURCE LINES 316-334
 
 .. code-block:: default
 
@@ -665,13 +666,13 @@ so we can read the log file and load the best schedules.
     Evaluate inference time cost...
     Execution time summary:
      mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
-      751.4052     751.8345     751.8703     750.5107      0.6326   
+      756.2779     756.2160     756.5215     756.0961      0.1791   
                
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 329-345
+.. GENERATED FROM PYTHON SOURCE LINES 335-351
 
 Other Tips
 ----------
@@ -693,7 +694,7 @@ Other Tips
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  20.647 seconds)
+   **Total running time of the script:** ( 1 minutes  21.651 seconds)
 
 
 .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_x86.py:
diff --git a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
index 22c21d210..45b067ad5 100644
--- a/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autoscheduler/tune_sparse_x86.rst.txt
@@ -37,11 +37,12 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 
-.. GENERATED FROM PYTHON SOURCE LINES 37-48
+.. GENERATED FROM PYTHON SOURCE LINES 37-49
 
 .. code-block:: default
 
 
+
     import os
 
     import numpy as np
@@ -59,7 +60,7 @@ __name__ == "__main__":` block.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 49-54
+.. GENERATED FROM PYTHON SOURCE LINES 55-60
 
 Define the computation
 ^^^^^^^^^^^^^^^^^^^^^^
@@ -67,7 +68,7 @@ To begin with, let us define the computation of a sparse matmul with several rel
 The function should return the list of input/output tensors.
 From these tensors, the auto-scheduler can get the whole computational graph.
 
-.. GENERATED FROM PYTHON SOURCE LINES 54-71
+.. GENERATED FROM PYTHON SOURCE LINES 60-77
 
 .. code-block:: default
 
@@ -95,7 +96,7 @@ From these tensors, the auto-scheduler can get the whole computational graph.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 72-81
+.. GENERATED FROM PYTHON SOURCE LINES 78-87
 
 Special step for sparse workload
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -107,7 +108,7 @@ To solve this problem, we register these as special buffers, and load them when
 measuring.
 See the `tvm.auto_scheduler.measure.py` for more details.
 
-.. GENERATED FROM PYTHON SOURCE LINES 81-100
+.. GENERATED FROM PYTHON SOURCE LINES 87-106
 
 .. code-block:: default
 
@@ -137,7 +138,7 @@ See the `tvm.auto_scheduler.measure.py` for more details.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 101-108
+.. GENERATED FROM PYTHON SOURCE LINES 107-114
 
 Create the search task
 ^^^^^^^^^^^^^^^^^^^^^^
@@ -147,7 +148,7 @@ If your machine supports avx instructions, you can
   - replace "llvm" below with "llvm -mcpu=core-avx2" to enable AVX2
   - replace "llvm" below with "llvm -mcpu=skylake-avx512" to enable AVX-512
 
-.. GENERATED FROM PYTHON SOURCE LINES 108-136
+.. GENERATED FROM PYTHON SOURCE LINES 114-142
 
 .. code-block:: default
 
@@ -203,7 +204,7 @@ If your machine supports avx instructions, you can
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 137-147
+.. GENERATED FROM PYTHON SOURCE LINES 143-153
 
 Write the custom sketch for sparse dense op
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -216,7 +217,7 @@ CustomSketchRule consists of two parts: the condition function and the apply fun
   - apply function: describe how to generate the initial sketch. You can implement it using
     auto-scheduler provided loop state APIs.
 
-.. GENERATED FROM PYTHON SOURCE LINES 147-202
+.. GENERATED FROM PYTHON SOURCE LINES 153-208
 
 .. code-block:: default
 
@@ -282,7 +283,7 @@ CustomSketchRule consists of two parts: the condition function and the apply fun
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 203-215
+.. GENERATED FROM PYTHON SOURCE LINES 209-221
 
 Next, we set parameters for the auto-scheduler with the custom sketch plugged in.
 
@@ -297,7 +298,7 @@ Next, we set parameters for the auto-scheduler with the custom sketch plugged in
 * Here, we need to create a :code:`auto_scheduler.SketchPolicy` object, and add the custom sketch
   rule as a `init_search_callbacks`.
 
-.. GENERATED FROM PYTHON SOURCE LINES 215-231
+.. GENERATED FROM PYTHON SOURCE LINES 221-237
 
 .. code-block:: default
 
@@ -324,7 +325,7 @@ Next, we set parameters for the auto-scheduler with the custom sketch plugged in
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 232-238
+.. GENERATED FROM PYTHON SOURCE LINES 238-244
 
 Run the search
 ^^^^^^^^^^^^^^
@@ -333,7 +334,7 @@ We can kick off the search and let the auto-scheduler do its magic.
 After some measurement trials, we can load the best schedule from the log
 file and apply it.
 
-.. GENERATED FROM PYTHON SOURCE LINES 238-247
+.. GENERATED FROM PYTHON SOURCE LINES 244-253
 
 .. code-block:: default
 
@@ -364,13 +365,13 @@ file and apply it.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 248-251
+.. GENERATED FROM PYTHON SOURCE LINES 254-257
 
 We can lower the schedule to see the IR after auto-scheduling.
 The auto-scheduler correctly performs optimizations including multi-level tiling,
 layout transformation, parallelization, vectorization, unrolling, and operator fusion.
 
-.. GENERATED FROM PYTHON SOURCE LINES 251-255
+.. GENERATED FROM PYTHON SOURCE LINES 257-261
 
 .. code-block:: default
 
@@ -396,80 +397,77 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
                  placeholder_4: Buffer(placeholder_14: Pointer(float32), float32, [65536], []),
                  compute: Buffer(compute_2: Pointer(float32), float32, [65536], [])}
       buffer_map = {placeholder_5: placeholder, placeholder_6: placeholder_1, placeholder_7: placeholder_2, placeholder_8: placeholder_3, placeholder_9: placeholder_4, compute_1: compute}
-      preflattened_buffer_map = {placeholder_5: placeholder_15: Buffer(placeholder_10, float32, [128, 256], []), placeholder_9: placeholder_16: Buffer(placeholder_14, float32, [128, 512], []), placeholder_8: placeholder_17: Buffer(placeholder_13, int32, [33], []), placeholder_6: placeholder_18: Buffer(placeholder_11, float32, [4916, 16, 1], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_7: placeholder_19: Buffer(placeholder_12, int32, [4916], [])} {
-      for (i0.outer.i1.outer.fused: int32, 0, 16) "parallel" {
-        allocate(compute_4: Pointer(global float32), float32, [4096]), storage_scope = global {
-          for (i.outer.inner: int32, 0, 4) {
-            for (nb_j.inner: int32, 0, 2) {
-              for (i.inner.init: int32, 0, 32) {
-                let cse_var_1: int32 = (((i.outer.inner*1024) + (i.inner.init*32)) + (nb_j.inner*16))
-                 {
-                  compute_5: Buffer(compute_4, float32, [4096], [])[cse_var_1] = 0f32
-                  compute_5[(cse_var_1 + 1)] = 0f32
-                  compute_5[(cse_var_1 + 2)] = 0f32
-                  compute_5[(cse_var_1 + 3)] = 0f32
-                  compute_5[(cse_var_1 + 4)] = 0f32
-                  compute_5[(cse_var_1 + 5)] = 0f32
-                  compute_5[(cse_var_1 + 6)] = 0f32
-                  compute_5[(cse_var_1 + 7)] = 0f32
-                  compute_5[(cse_var_1 + 8)] = 0f32
-                  compute_5[(cse_var_1 + 9)] = 0f32
-                  compute_5[(cse_var_1 + 10)] = 0f32
-                  compute_5[(cse_var_1 + 11)] = 0f32
-                  compute_5[(cse_var_1 + 12)] = 0f32
-                  compute_5[(cse_var_1 + 13)] = 0f32
-                  compute_5[(cse_var_1 + 14)] = 0f32
-                  compute_5[(cse_var_1 + 15)] = 0f32
-                }
+      preflattened_buffer_map = {placeholder_6: placeholder_15: Buffer(placeholder_11, float32, [4916, 16, 1], []), placeholder_5: placeholder_16: Buffer(placeholder_10, float32, [128, 256], []), placeholder_8: placeholder_17: Buffer(placeholder_13, int32, [33], []), compute_1: compute_3: Buffer(compute_2, float32, [128, 512], []), placeholder_7: placeholder_18: Buffer(placeholder_12, int32, [4916], []), placeholder_9: placeholder_19: Buffer(placeholder_14, float32, [128, 512], [])} {
+      for (i0.outer: int32, 0, 16) "parallel" {
+        allocate(compute_4: Pointer(global float32), float32, [256]), storage_scope = global;
+        for (i1.outer: int32, 0, 16) {
+          for (nb_j.inner: int32, 0, 2) {
+            for (i.inner.init: int32, 0, 8) {
+              let cse_var_1: int32 = ((i.inner.init*32) + (nb_j.inner*16))
+               {
+                compute_5: Buffer(compute_4, float32, [256], [])[cse_var_1] = 0f32
+                compute_5[(cse_var_1 + 1)] = 0f32
+                compute_5[(cse_var_1 + 2)] = 0f32
+                compute_5[(cse_var_1 + 3)] = 0f32
+                compute_5[(cse_var_1 + 4)] = 0f32
+                compute_5[(cse_var_1 + 5)] = 0f32
+                compute_5[(cse_var_1 + 6)] = 0f32
+                compute_5[(cse_var_1 + 7)] = 0f32
+                compute_5[(cse_var_1 + 8)] = 0f32
+                compute_5[(cse_var_1 + 9)] = 0f32
+                compute_5[(cse_var_1 + 10)] = 0f32
+                compute_5[(cse_var_1 + 11)] = 0f32
+                compute_5[(cse_var_1 + 12)] = 0f32
+                compute_5[(cse_var_1 + 13)] = 0f32
+                compute_5[(cse_var_1 + 14)] = 0f32
+                compute_5[(cse_var_1 + 15)] = 0f32
               }
-              for (elem_idx: int32, 0, let cse_var_2: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner) in (placeholder_3[(cse_var_2 + 1)] - placeholder_3[cse_var_2])) {
-                for (i.inner: int32, 0, 32) {
-                  let cse_var_21: int32 = (elem_idx*16)
-                  let cse_var_20: int32 = ((i0.outer.i1.outer.fused*2) + nb_j.inner)
-                  let cse_var_19: int32 = ((i.outer.inner*8192) + (i.inner*256))
-                  let cse_var_18: int32 = (((i.outer.inner*1024) + (i.inner*32)) + (nb_j.inner*16))
-                  let cse_var_17: int32 = (cse_var_18 + 9)
-                  let cse_var_16: int32 = (cse_var_18 + 8)
-                  let cse_var_15: int32 = (cse_var_18 + 7)
-                  let cse_var_14: int32 = (cse_var_18 + 6)
-                  let cse_var_13: int32 = (cse_var_18 + 5)
-                  let cse_var_12: int32 = (cse_var_18 + 4)
-                  let cse_var_11: int32 = (cse_var_18 + 3)
-                  let cse_var_10: int32 = (cse_var_18 + 2)
-                  let cse_var_9: int32 = (cse_var_18 + 15)
-                  let cse_var_8: int32 = (cse_var_18 + 14)
-                  let cse_var_7: int32 = (cse_var_18 + 13)
-                  let cse_var_6: int32 = (cse_var_18 + 12)
-                  let cse_var_5: int32 = (cse_var_18 + 11)
-                  let cse_var_4: int32 = (cse_var_18 + 10)
-                  let cse_var_3: int32 = (cse_var_18 + 1)
-                   {
-                    compute_5[cse_var_18] = (compute_5[cse_var_18] + (placeholder_1[((placeholder_3[cse_var_20]*16) + cse_var_21)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_3] = (compute_5[cse_var_3] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 1)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_10] = (compute_5[cse_var_10] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 2)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_11] = (compute_5[cse_var_11] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 3)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_12] = (compute_5[cse_var_12] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 4)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_13] = (compute_5[cse_var_13] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 5)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_14] = (compute_5[cse_var_14] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 6)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_15] = (compute_5[cse_var_15] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 7)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_16] = (compute_5[cse_var_16] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 8)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_17] = (compute_5[cse_var_17] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 9)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_4] = (compute_5[cse_var_4] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 10)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_5] = (compute_5[cse_var_5] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 11)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_6] = (compute_5[cse_var_6] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 12)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_7] = (compute_5[cse_var_7] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 13)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_8] = (compute_5[cse_var_8] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 14)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                    compute_5[cse_var_9] = (compute_5[cse_var_9] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 15)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
-                  }
+            }
+            for (elem_idx: int32, 0, let cse_var_2: int32 = ((i1.outer*2) + nb_j.inner) in (placeholder_3[(cse_var_2 + 1)] - placeholder_3[cse_var_2])) {
+              for (i.inner: int32, 0, 8) {
+                let cse_var_21: int32 = (elem_idx*16)
+                let cse_var_20: int32 = ((i1.outer*2) + nb_j.inner)
+                let cse_var_19: int32 = ((i0.outer*2048) + (i.inner*256))
+                let cse_var_18: int32 = ((i.inner*32) + (nb_j.inner*16))
+                let cse_var_17: int32 = (cse_var_18 + 9)
+                let cse_var_16: int32 = (cse_var_18 + 8)
+                let cse_var_15: int32 = (cse_var_18 + 7)
+                let cse_var_14: int32 = (cse_var_18 + 6)
+                let cse_var_13: int32 = (cse_var_18 + 5)
+                let cse_var_12: int32 = (cse_var_18 + 4)
+                let cse_var_11: int32 = (cse_var_18 + 3)
+                let cse_var_10: int32 = (cse_var_18 + 2)
+                let cse_var_9: int32 = (cse_var_18 + 15)
+                let cse_var_8: int32 = (cse_var_18 + 14)
+                let cse_var_7: int32 = (cse_var_18 + 13)
+                let cse_var_6: int32 = (cse_var_18 + 12)
+                let cse_var_5: int32 = (cse_var_18 + 11)
+                let cse_var_4: int32 = (cse_var_18 + 10)
+                let cse_var_3: int32 = (cse_var_18 + 1)
+                 {
+                  compute_5[cse_var_18] = (compute_5[cse_var_18] + (placeholder_1[((placeholder_3[cse_var_20]*16) + cse_var_21)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_3] = (compute_5[cse_var_3] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 1)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_10] = (compute_5[cse_var_10] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 2)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_11] = (compute_5[cse_var_11] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 3)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_12] = (compute_5[cse_var_12] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 4)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_13] = (compute_5[cse_var_13] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 5)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_14] = (compute_5[cse_var_14] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 6)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_15] = (compute_5[cse_var_15] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 7)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_16] = (compute_5[cse_var_16] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 8)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_17] = (compute_5[cse_var_17] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 9)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_4] = (compute_5[cse_var_4] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 10)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_5] = (compute_5[cse_var_5] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 11)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_6] = (compute_5[cse_var_6] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 12)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_7] = (compute_5[cse_var_7] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 13)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_8] = (compute_5[cse_var_8] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 14)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
+                  compute_5[cse_var_9] = (compute_5[cse_var_9] + (placeholder_1[(((placeholder_3[cse_var_20]*16) + cse_var_21) + 15)]*max(placeholder[(cse_var_19 + placeholder_2[(placeholder_3[cse_var_20] + elem_idx)])], 0f32)))
                 }
               }
             }
           }
-          for (i0.inner: int32, 0, 128) {
-            for (i1.inner: int32, 0, 32) {
-              let cse_var_22: int32 = (((i0.inner*512) + (i0.outer.i1.outer.fused*32)) + i1.inner)
-              compute[cse_var_22] = max((compute_5[((i0.inner*32) + i1.inner)] + placeholder_4[cse_var_22]), 0f32)
-            }
+          for (i0.inner: int32, 0, 8) {
+            let cse_var_22: int32 = (((i0.outer*4096) + (i0.inner*512)) + (i1.outer*32))
+            compute[ramp(cse_var_22, 1, 32)] = max((compute_5[ramp((i0.inner*32), 1, 32)] + placeholder_4[ramp(cse_var_22, 1, 32)]), broadcast(0f32, 32))
           }
         }
       }
@@ -480,13 +478,13 @@ layout transformation, parallelization, vectorization, unrolling, and operator f
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 256-259
+.. GENERATED FROM PYTHON SOURCE LINES 262-265
 
 Check correctness and evaluate performance
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 We build the binary and check its correctness and performance.
 
-.. GENERATED FROM PYTHON SOURCE LINES 259-286
+.. GENERATED FROM PYTHON SOURCE LINES 265-292
 
 .. code-block:: default
 
@@ -525,12 +523,12 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 1.703 ms
+    Execution time of this operator: 1.826 ms
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 287-325
+.. GENERATED FROM PYTHON SOURCE LINES 293-331
 
 .. note:: Tuning result example
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
index b8a5846bd..6e7009850 100644
--- a/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/sg_execution_times.rst.txt
@@ -5,16 +5,16 @@
 
 Computation times
 =================
-**00:44.129** total execution time for **how_to_tune_with_autotvm** files:
+**00:44.740** total execution time for **how_to_tune_with_autotvm** files:
 
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)           | 00:44.097 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)           | 00:44.708 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)               | 00:00.019 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)               | 00:00.017 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_cuda.py` (``tune_relay_cuda.py``)             | 00:00.005 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_arm.py` (``tune_relay_arm.py``)               | 00:00.004 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_arm.py` (``tune_relay_arm.py``)               | 00:00.005 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``) | 00:00.004 | 0.0 MB |
+| :ref:`sphx_glr_how_to_tune_with_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``) | 00:00.005 | 0.0 MB |
 +--------------------------------------------------------------------------------------------------+-----------+--------+
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
index 1d5ce20cd..9fdb8e982 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_conv2d_cuda.rst.txt
@@ -30,7 +30,20 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 
-.. GENERATED FROM PYTHON SOURCE LINES 32-50
+.. GENERATED FROM PYTHON SOURCE LINES 30-32
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 38-56
 
 Install dependencies
 --------------------
@@ -51,7 +64,7 @@ as FFI of tvm. In the root directory of tvm, execute
 
 Now return to python code. Import packages.
 
-.. GENERATED FROM PYTHON SOURCE LINES 50-62
+.. GENERATED FROM PYTHON SOURCE LINES 56-68
 
 .. code-block:: default
 
@@ -74,7 +87,7 @@ Now return to python code. Import packages.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 63-85
+.. GENERATED FROM PYTHON SOURCE LINES 69-91
 
 Step 1:  Define the search space
 --------------------------------
@@ -99,7 +112,7 @@ It is worth noting that the search space for a conv2d operator
 can be very large (at the level of 10^9 for some input shapes)
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 85-175
+.. GENERATED FROM PYTHON SOURCE LINES 91-181
 
 .. code-block:: default
 
@@ -200,7 +213,7 @@ can be very large (at the level of 10^9 for some input shapes)
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 176-183
+.. GENERATED FROM PYTHON SOURCE LINES 182-189
 
 Step 2:  Search through the space
 ---------------------------------
@@ -210,7 +223,7 @@ for our case. Here we only do 20 trials for demonstration.
 In practice, making 1000 trials usually can find some good kernels
 for this template
 
-.. GENERATED FROM PYTHON SOURCE LINES 183-212
+.. GENERATED FROM PYTHON SOURCE LINES 189-218
 
 .. code-block:: default
 
@@ -879,8 +892,8 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 4, 32]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2885496
-    No: 6   GFLOPS: 63.15/63.15     result: MeasureResult(costs=(0.003665846,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6695268154144287, timestamp=1656607512.7788217)        [('tile_f', [-1, 1, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3754080
-    No: 7   GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 6   GFLOPS: 104.01/104.01   result: MeasureResult(costs=(0.0022258615,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.6293339729309082, timestamp=1656614191.1079452)       [('tile_f', [-1, 1, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3754080
+    No: 7   GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1003,7 +1016,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 16, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 256, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6225319
-    No: 8   GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 8   GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1126,7 +1139,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 1, 32]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 8, 64]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,943546
-    No: 9   GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 9   GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1249,7 +1262,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 4, 16, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 16, 32]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2868708
-    No: 10  GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 10  GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 142, in build
         res = future.result()
       File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
@@ -1267,7 +1280,7 @@ for this template
     TimeoutError
 
             [('tile_f', [-1, 32, 2, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 4, 2]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4691833
-    No: 11  GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 11  GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1390,7 +1403,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 2, 64]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,1042124
-    No: 12  GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 12  GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1513,7 +1526,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 32, 1, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 32, 16]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,10013405
-    No: 13  GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 13  GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1636,7 +1649,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 8, 8, 2]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 4, 32]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6732082
-    No: 14  GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 14  GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1759,7 +1772,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 4, 32]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 128]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 1)],None,7536735
-    No: 15  GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 15  GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -1882,7 +1895,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 1, 4]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 128, 4]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,482121
-    No: 16  GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 16  GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -2005,7 +2018,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 2, 1, 16]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 32, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2824525
-    No: 17  GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 17  GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -2128,7 +2141,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 64, 1, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 8, 8]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4559286
-    No: 18  GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 18  GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 588, in __call__
         func, arg_info = _build_func_common(measure_input, self.runtime, **kwargs)
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 540, in _build_func_common
@@ -2251,7 +2264,7 @@ for this template
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 871, in verify_pass
         raise InstantiationError("Skipped because of invalid gpu kernel")
     tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel        [('tile_f', [-1, 1, 32, 16]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 512]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9677544
-    No: 19  GFLOPS: 0.00/63.15      result: Traceback (most recent call last):
+    No: 19  GFLOPS: 0.00/104.01     result: Traceback (most recent call last):
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 738, in __call__
         yield remote, remote.load_module(os.path.split(build_result.filename)[1])
       File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 702, in run_through_rpc
@@ -2339,7 +2352,7 @@ for this template
       15: _PyEval_EvalFrameDefault
       14: 0x0000000000537c30
       13: _PyObject_FastCallKeywords
-      12: 0x00007fd908292fa2
+      12: 0x00007f2733083fa2
       11: _ctypes_callproc
       10: ffi_call
       9: ffi_call_unix64
@@ -2404,17 +2417,17 @@ for this template
       21: _PyFunction_FastCallKeywords
       20: _PyEval_EvalFrameDefault
       19: _PyFunction_FastCall      [('tile_f', [-1, 8, 2, 16]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,6390073
-    No: 20  GFLOPS: 144.84/144.84   result: MeasureResult(costs=(0.00159831941,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4561679363250732, timestamp=1656607539.5023582)      [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
+    No: 20  GFLOPS: 142.92/142.92   result: MeasureResult(costs=(0.00161977549,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4490108489990234, timestamp=1656614217.0833886)      [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 213-215
+.. GENERATED FROM PYTHON SOURCE LINES 219-221
 
 Finally we can inspect the best config from log file, check correctness,
 and measure running time.
 
-.. GENERATED FROM PYTHON SOURCE LINES 215-245
+.. GENERATED FROM PYTHON SOURCE LINES 221-251
 
 .. code-block:: default
 
@@ -2461,7 +2474,7 @@ and measure running time.
     Best config:
     [('tile_f', [-1, 1, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,9881539
     Finish loading 20 records
-    Time cost of this operator: 0.001975
+    Time cost of this operator: 0.002030
 
 
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_relay_arm.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_relay_arm.rst.txt
index ac984824f..5b5e210df 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_relay_arm.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_relay_arm.rst.txt
@@ -43,7 +43,20 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 
-.. GENERATED FROM PYTHON SOURCE LINES 45-64
+.. GENERATED FROM PYTHON SOURCE LINES 43-45
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 51-70
 
 Install dependencies
 --------------------
@@ -65,7 +78,7 @@ as FFI of TVM. In the root directory of TVM, execute
 
 Now return to python code. Import packages.
 
-.. GENERATED FROM PYTHON SOURCE LINES 64-75
+.. GENERATED FROM PYTHON SOURCE LINES 70-81
 
 .. code-block:: default
 
@@ -87,7 +100,7 @@ Now return to python code. Import packages.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 76-81
+.. GENERATED FROM PYTHON SOURCE LINES 82-87
 
 Define network
 --------------
@@ -95,7 +108,7 @@ First we need to define the network in relay frontend API.
 We can load some pre-defined network from :code:`relay.testing`.
 We can also load models from MXNet, ONNX and TensorFlow.
 
-.. GENERATED FROM PYTHON SOURCE LINES 81-124
+.. GENERATED FROM PYTHON SOURCE LINES 87-130
 
 .. code-block:: default
 
@@ -149,7 +162,7 @@ We can also load models from MXNet, ONNX and TensorFlow.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 125-149
+.. GENERATED FROM PYTHON SOURCE LINES 131-155
 
 Start RPC Tracker
 -----------------
@@ -176,7 +189,7 @@ The expected output is
 
   INFO:RPCTracker:bind to 0.0.0.0:9190
 
-.. GENERATED FROM PYTHON SOURCE LINES 151-193
+.. GENERATED FROM PYTHON SOURCE LINES 157-199
 
 Register Devices to RPC Tracker
 -----------------------------------
@@ -221,7 +234,7 @@ the output can be
 
 You can register multiple devices to the tracker to accelerate the measurement in tuning.
 
-.. GENERATED FROM PYTHON SOURCE LINES 195-200
+.. GENERATED FROM PYTHON SOURCE LINES 201-206
 
 Set Tuning Options
 ------------------
@@ -229,7 +242,7 @@ Before tuning, we should apply some configurations. Here I use an RK3399 board
 as example. In your setting, you should modify the target and device_key accordingly.
 set :code:`use_android` to True if you use android phone.
 
-.. GENERATED FROM PYTHON SOURCE LINES 200-235
+.. GENERATED FROM PYTHON SOURCE LINES 206-241
 
 .. code-block:: default
 
@@ -275,7 +288,7 @@ set :code:`use_android` to True if you use android phone.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 236-248
+.. GENERATED FROM PYTHON SOURCE LINES 242-254
 
 .. note:: How to set tuning options
 
@@ -290,7 +303,7 @@ set :code:`use_android` to True if you use android phone.
   optimization in general. For example, on ARM CPU A53 2.0GHz, we find it could boost 1.6x
   performance of depthwise convolution on Mobilenet V1 model.
 
-.. GENERATED FROM PYTHON SOURCE LINES 251-257
+.. GENERATED FROM PYTHON SOURCE LINES 257-263
 
 Begin Tuning
 ------------
@@ -299,7 +312,7 @@ Here, we provide a simple utility function to tune a list of tasks.
 This function is just an initial implementation which tunes them in sequential order.
 We will introduce a more sophisticated tuning scheduler in the future.
 
-.. GENERATED FROM PYTHON SOURCE LINES 257-315
+.. GENERATED FROM PYTHON SOURCE LINES 263-321
 
 .. code-block:: default
 
@@ -368,11 +381,11 @@ We will introduce a more sophisticated tuning scheduler in the future.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 316-317
+.. GENERATED FROM PYTHON SOURCE LINES 322-323
 
 Finally, we launch tuning jobs and evaluate the end-to-end performance.
 
-.. GENERATED FROM PYTHON SOURCE LINES 317-370
+.. GENERATED FROM PYTHON SOURCE LINES 323-376
 
 .. code-block:: default
 
@@ -436,7 +449,7 @@ Finally, we launch tuning jobs and evaluate the end-to-end performance.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 371-398
+.. GENERATED FROM PYTHON SOURCE LINES 377-404
 
 Sample Output
 -------------
@@ -466,7 +479,7 @@ It takes about 2 hours on a 32T AMD Ryzen Threadripper.
    Evaluate inference time cost...
    Mean inference time (std dev): 162.59 ms (0.06 ms)
 
-.. GENERATED FROM PYTHON SOURCE LINES 400-416
+.. GENERATED FROM PYTHON SOURCE LINES 406-422
 
 .. note:: **Experiencing Difficulties?**
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_relay_cuda.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_relay_cuda.rst.txt
index ee7072315..86fe81848 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_relay_cuda.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_relay_cuda.rst.txt
@@ -41,7 +41,20 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 
-.. GENERATED FROM PYTHON SOURCE LINES 43-61
+.. GENERATED FROM PYTHON SOURCE LINES 41-43
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 49-67
 
 Install dependencies
 --------------------
@@ -62,7 +75,7 @@ as FFI of tvm. In the root directory of tvm, execute:
 
 Now return to python code. Import packages.
 
-.. GENERATED FROM PYTHON SOURCE LINES 61-72
+.. GENERATED FROM PYTHON SOURCE LINES 67-78
 
 .. code-block:: default
 
@@ -84,7 +97,7 @@ Now return to python code. Import packages.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 73-78
+.. GENERATED FROM PYTHON SOURCE LINES 79-84
 
 Define Network
 --------------
@@ -92,7 +105,7 @@ First we need to define the network in relay frontend API.
 We can load some pre-defined network from :code:`tvm.relay.testing`.
 We can also load models from MXNet, ONNX and TensorFlow.
 
-.. GENERATED FROM PYTHON SOURCE LINES 78-121
+.. GENERATED FROM PYTHON SOURCE LINES 84-127
 
 .. code-block:: default
 
@@ -146,13 +159,13 @@ We can also load models from MXNet, ONNX and TensorFlow.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 122-125
+.. GENERATED FROM PYTHON SOURCE LINES 128-131
 
 Set Tuning Options
 ------------------
 Before tuning, we apply some configurations.
 
-.. GENERATED FROM PYTHON SOURCE LINES 125-145
+.. GENERATED FROM PYTHON SOURCE LINES 131-151
 
 .. code-block:: default
 
@@ -190,7 +203,7 @@ Before tuning, we apply some configurations.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 146-156
+.. GENERATED FROM PYTHON SOURCE LINES 152-162
 
 .. note:: How to set tuning options
 
@@ -203,7 +216,7 @@ Before tuning, we apply some configurations.
   accelerate the tuning process. (see the 'Scale up measurement` section below).
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 159-165
+.. GENERATED FROM PYTHON SOURCE LINES 165-171
 
 Begin Tuning
 ------------
@@ -212,7 +225,7 @@ Here, we provide a simple utility function to tune a list of tasks.
 This function is just an initial implementation which tunes them in sequential order.
 We will introduce a more sophisticated tuning scheduler in the future.
 
-.. GENERATED FROM PYTHON SOURCE LINES 165-217
+.. GENERATED FROM PYTHON SOURCE LINES 171-223
 
 .. code-block:: default
 
@@ -275,11 +288,11 @@ We will introduce a more sophisticated tuning scheduler in the future.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 218-219
+.. GENERATED FROM PYTHON SOURCE LINES 224-225
 
 Finally, we launch tuning jobs and evaluate the end-to-end performance.
 
-.. GENERATED FROM PYTHON SOURCE LINES 219-255
+.. GENERATED FROM PYTHON SOURCE LINES 225-261
 
 .. code-block:: default
 
@@ -326,7 +339,7 @@ Finally, we launch tuning jobs and evaluate the end-to-end performance.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 256-289
+.. GENERATED FROM PYTHON SOURCE LINES 262-295
 
 Sample Output
 -------------
@@ -362,7 +375,7 @@ The tuning target is NVIDIA 1080 Ti.
 
 As a reference baseline, the time cost of MXNet + TensorRT on resnet-18 is 1.30ms. So we are a little faster.
 
-.. GENERATED FROM PYTHON SOURCE LINES 291-307
+.. GENERATED FROM PYTHON SOURCE LINES 297-313
 
 .. note:: **Experiencing Difficulties?**
 
@@ -381,11 +394,11 @@ As a reference baseline, the time cost of MXNet + TensorRT on resnet-18 is 1.30m
 
   Finally, always feel free to ask our community for help on https://discuss.tvm.apache.org
 
-.. GENERATED FROM PYTHON SOURCE LINES 310-311
+.. GENERATED FROM PYTHON SOURCE LINES 316-317
 
 .. _tutorials-autotvm-scale-up-rpc-tracker:
 
-.. GENERATED FROM PYTHON SOURCE LINES 313-366
+.. GENERATED FROM PYTHON SOURCE LINES 319-372
 
 Scale up measurement by using multiple devices
 ----------------------------------------------
@@ -441,7 +454,7 @@ For example, if we have four 1080ti, two titanx and one gfx900, the output can b
 Finally, we need to change the tuning option to use RPCRunner. Use the code below
 to replace the corresponding part above.
 
-.. GENERATED FROM PYTHON SOURCE LINES 366-385
+.. GENERATED FROM PYTHON SOURCE LINES 372-391
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_relay_mobile_gpu.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_relay_mobile_gpu.rst.txt
index 1d1e44d31..b8f655d45 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_relay_mobile_gpu.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_relay_mobile_gpu.rst.txt
@@ -41,7 +41,20 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 
-.. GENERATED FROM PYTHON SOURCE LINES 43-62
+.. GENERATED FROM PYTHON SOURCE LINES 41-43
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 49-68
 
 Install dependencies
 --------------------
@@ -63,7 +76,7 @@ as FFI of tvm. In the root directory of tvm, execute
 
 Now return to python code. Import packages.
 
-.. GENERATED FROM PYTHON SOURCE LINES 62-74
+.. GENERATED FROM PYTHON SOURCE LINES 68-80
 
 .. code-block:: default
 
@@ -86,7 +99,7 @@ Now return to python code. Import packages.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 75-80
+.. GENERATED FROM PYTHON SOURCE LINES 81-86
 
 Define network
 --------------
@@ -94,7 +107,7 @@ First we need to define the network in relay frontend API.
 We can load some pre-defined network from :code:`relay.testing`.
 We can also load models from MXNet, ONNX and TensorFlow.
 
-.. GENERATED FROM PYTHON SOURCE LINES 80-123
+.. GENERATED FROM PYTHON SOURCE LINES 86-129
 
 .. code-block:: default
 
@@ -148,11 +161,11 @@ We can also load models from MXNet, ONNX and TensorFlow.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 124-125
+.. GENERATED FROM PYTHON SOURCE LINES 130-131
 
 .. _tutorials-autotvm-start-rpc-tracker:
 
-.. GENERATED FROM PYTHON SOURCE LINES 127-151
+.. GENERATED FROM PYTHON SOURCE LINES 133-157
 
 Start RPC Tracker
 -----------------
@@ -179,7 +192,7 @@ The expected output is
 
   INFO:RPCTracker:bind to 0.0.0.0:9190
 
-.. GENERATED FROM PYTHON SOURCE LINES 153-195
+.. GENERATED FROM PYTHON SOURCE LINES 159-201
 
 Register Devices to RPC Tracker
 -----------------------------------
@@ -224,7 +237,7 @@ the output can be
 
 You can register multiple devices to the tracker to accelerate the measurement in tuning.
 
-.. GENERATED FROM PYTHON SOURCE LINES 197-202
+.. GENERATED FROM PYTHON SOURCE LINES 203-208
 
 Set Tuning Options
 ------------------
@@ -232,7 +245,7 @@ Before tuning, we should apply some configurations. Here I use an RK3399 board
 as example. In your setting, you should modify the target and device_key accordingly.
 set :code:`use_android` to True if you use android phone.
 
-.. GENERATED FROM PYTHON SOURCE LINES 202-236
+.. GENERATED FROM PYTHON SOURCE LINES 208-242
 
 .. code-block:: default
 
@@ -277,7 +290,7 @@ set :code:`use_android` to True if you use android phone.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 237-245
+.. GENERATED FROM PYTHON SOURCE LINES 243-251
 
 .. note:: How to set tuning options
 
@@ -288,7 +301,7 @@ set :code:`use_android` to True if you use android phone.
   set timeout larger.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 248-254
+.. GENERATED FROM PYTHON SOURCE LINES 254-260
 
 Begin Tuning
 ------------
@@ -297,7 +310,7 @@ Here, we provide a simple utility function to tune a list of tasks.
 This function is just an initial implementation which tunes them in sequential order.
 We will introduce a more sophisticated tuning scheduler in the future.
 
-.. GENERATED FROM PYTHON SOURCE LINES 254-306
+.. GENERATED FROM PYTHON SOURCE LINES 260-312
 
 .. code-block:: default
 
@@ -360,11 +373,11 @@ We will introduce a more sophisticated tuning scheduler in the future.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 307-308
+.. GENERATED FROM PYTHON SOURCE LINES 313-314
 
 Finally, we launch tuning jobs and evaluate the end-to-end performance.
 
-.. GENERATED FROM PYTHON SOURCE LINES 308-363
+.. GENERATED FROM PYTHON SOURCE LINES 314-369
 
 .. code-block:: default
 
@@ -430,7 +443,7 @@ Finally, we launch tuning jobs and evaluate the end-to-end performance.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 364-396
+.. GENERATED FROM PYTHON SOURCE LINES 370-402
 
 Sample Output
 -------------
@@ -465,7 +478,7 @@ One sample output is listed below. It takes about 3 hours on a 32T AMD Ryzen Thr
    Mean inference time (std dev): 128.05 ms (7.74 ms)
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 398-414
+.. GENERATED FROM PYTHON SOURCE LINES 404-420
 
 .. note:: **Experiencing Difficulties?**
 
diff --git a/docs/_sources/how_to/tune_with_autotvm/tune_relay_x86.rst.txt b/docs/_sources/how_to/tune_with_autotvm/tune_relay_x86.rst.txt
index ef1d98832..951dbaf92 100644
--- a/docs/_sources/how_to/tune_with_autotvm/tune_relay_x86.rst.txt
+++ b/docs/_sources/how_to/tune_with_autotvm/tune_relay_x86.rst.txt
@@ -31,10 +31,11 @@ Note that this tutorial will not run on Windows or recent versions of macOS. To
 get it to run, you will need to wrap the body of this tutorial in a :code:`if
 __name__ == "__main__":` block.
 
-.. GENERATED FROM PYTHON SOURCE LINES 31-41
+.. GENERATED FROM PYTHON SOURCE LINES 31-42
 
 .. code-block:: default
 
+
     import os
     import numpy as np
 
@@ -52,7 +53,7 @@ __name__ == "__main__":` block.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 42-50
+.. GENERATED FROM PYTHON SOURCE LINES 48-56
 
 Define network
 --------------
@@ -63,7 +64,7 @@ We can also load models from MXNet, ONNX and TensorFlow.
 
 In this tutorial, we choose resnet-18 as tuning example.
 
-.. GENERATED FROM PYTHON SOURCE LINES 50-116
+.. GENERATED FROM PYTHON SOURCE LINES 56-122
 
 .. code-block:: default
 
@@ -140,7 +141,7 @@ In this tutorial, we choose resnet-18 as tuning example.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 117-133
+.. GENERATED FROM PYTHON SOURCE LINES 123-139
 
 Configure tensor tuning settings and create tasks
 -------------------------------------------------
@@ -159,7 +160,7 @@ times and use the average of results. In addition, we need to flush the cache
 for the weight tensors between repeated measurements. This can make the measured
 latency of one operator closer to its actual latency during end-to-end inference.
 
-.. GENERATED FROM PYTHON SOURCE LINES 133-193
+.. GENERATED FROM PYTHON SOURCE LINES 139-199
 
 .. code-block:: default
 
@@ -230,11 +231,11 @@ latency of one operator closer to its actual latency during end-to-end inference
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 194-195
+.. GENERATED FROM PYTHON SOURCE LINES 200-201
 
 Finally, we launch tuning jobs and evaluate the end-to-end performance.
 
-.. GENERATED FROM PYTHON SOURCE LINES 195-250
+.. GENERATED FROM PYTHON SOURCE LINES 201-256
 
 .. code-block:: default
 
@@ -300,7 +301,7 @@ Finally, we launch tuning jobs and evaluate the end-to-end performance.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 251-299
+.. GENERATED FROM PYTHON SOURCE LINES 257-305
 
 Sample Output
 -------------
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
index 095cc8432..dfcab19ca 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_autotune.rst.txt
@@ -28,11 +28,12 @@ Autotuning with microTVM
 
 This tutorial explains how to autotune a model using the C runtime.
 
-.. GENERATED FROM PYTHON SOURCE LINES 29-40
+.. GENERATED FROM PYTHON SOURCE LINES 29-41
 
 .. code-block:: default
 
 
+
     import os
     import json
     import numpy as np
@@ -50,7 +51,7 @@ This tutorial explains how to autotune a model using the C runtime.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 41-47
+.. GENERATED FROM PYTHON SOURCE LINES 47-53
 
 Defining the model
 ###################
@@ -59,7 +60,7 @@ Defining the model
  fill parameters with random numbers.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 47-72
+.. GENERATED FROM PYTHON SOURCE LINES 53-78
 
 .. code-block:: default
 
@@ -95,7 +96,7 @@ Defining the model
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 73-84
+.. GENERATED FROM PYTHON SOURCE LINES 79-90
 
 Defining the target
 ######################
@@ -109,7 +110,7 @@ Defining the target
  this tutorial.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 84-101
+.. GENERATED FROM PYTHON SOURCE LINES 90-107
 
 .. code-block:: default
 
@@ -137,7 +138,7 @@ Defining the target
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 102-111
+.. GENERATED FROM PYTHON SOURCE LINES 108-117
 
 Extracting tuning tasks
 ########################
@@ -149,7 +150,7 @@ Extracting tuning tasks
  transformation passes; we'll apply the same configuration later on during autotuning.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 111-117
+.. GENERATED FROM PYTHON SOURCE LINES 117-123
 
 .. code-block:: default
 
@@ -166,7 +167,7 @@ Extracting tuning tasks
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 118-128
+.. GENERATED FROM PYTHON SOURCE LINES 124-134
 
 Configuring microTVM
 #####################
@@ -179,7 +180,7 @@ Configuring microTVM
  choose other options by choosing from `PLATFORM` list.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 128-166
+.. GENERATED FROM PYTHON SOURCE LINES 134-172
 
 .. code-block:: default
 
@@ -228,14 +229,14 @@ Configuring microTVM
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 167-171
+.. GENERATED FROM PYTHON SOURCE LINES 173-177
 
 Run Autotuning
 #########################
  Now we can run autotuning separately on each extracted task on microTVM device.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 171-189
+.. GENERATED FROM PYTHON SOURCE LINES 177-195
 
 .. code-block:: default
 
@@ -264,7 +265,7 @@ Run Autotuning
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 190-196
+.. GENERATED FROM PYTHON SOURCE LINES 196-202
 
 Timing the untuned program
 ###########################
@@ -273,7 +274,7 @@ Timing the untuned program
  the tuned operator.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 196-234
+.. GENERATED FROM PYTHON SOURCE LINES 202-240
 
 .. code-block:: default
 
@@ -328,21 +329,21 @@ Timing the untuned program
     ########## Build without Autotuning ##########
     Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)  
     ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------  
-    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  309.8     98.729   (1, 2, 10, 10, 3)  2       1        [309.8]           
-    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.032     0.966    (1, 6, 10, 10)     1       1        [3.032]           
-    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.957     0.305    (1, 1, 10, 10, 3)  1       1        [0.957]           
-    Total_time                                    -                                             313.789   -        -                  -       -        -                 
+    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  311.8     98.714   (1, 2, 10, 10, 3)  2       1        [311.8]           
+    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       3.087     0.977    (1, 6, 10, 10)     1       1        [3.087]           
+    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.976     0.309    (1, 1, 10, 10, 3)  1       1        [0.976]           
+    Total_time                                    -                                             315.863   -        -                  -       -        -                 
 
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 235-238
+.. GENERATED FROM PYTHON SOURCE LINES 241-244
 
 Timing the tuned program
 #########################
  Once autotuning completes, you can time execution of the entire program using the Debug Runtime:
 
-.. GENERATED FROM PYTHON SOURCE LINES 238-276
+.. GENERATED FROM PYTHON SOURCE LINES 244-282
 
 .. code-block:: default
 
@@ -397,10 +398,10 @@ Timing the tuned program
     ########## Build with Autotuning ##########
     Node Name                                     Ops                                           Time(us)  Time(%)  Shape              Inputs  Outputs  Measurements(us)  
     ---------                                     ---                                           --------  -------  -----              ------  -------  ----------------  
-    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  84.25     96.829   (1, 6, 10, 10, 1)  2       1        [84.25]           
-    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.799     2.068    (1, 6, 10, 10)     1       1        [1.799]           
-    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.96      1.104    (1, 1, 10, 10, 3)  1       1        [0.96]            
-    Total_time                                    -                                             87.009    -        -                  -       -        -                 
+    tvmgen_default_fused_nn_contrib_conv2d_NCHWc  tvmgen_default_fused_nn_contrib_conv2d_NCHWc  247.9     98.841   (1, 1, 10, 10, 6)  2       1        [247.9]           
+    tvmgen_default_fused_layout_transform_1       tvmgen_default_fused_layout_transform_1       1.957     0.78     (1, 6, 10, 10)     1       1        [1.957]           
+    tvmgen_default_fused_layout_transform         tvmgen_default_fused_layout_transform         0.951     0.379    (1, 1, 10, 10, 3)  1       1        [0.951]           
+    Total_time                                    -                                             250.807   -        -                  -       -        -                 
 
 
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_ethosu.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_ethosu.rst.txt
index cdd34a14d..eba763383 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_ethosu.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_ethosu.rst.txt
@@ -39,7 +39,20 @@ It provides a programmer's view that is suitable for software development.
 In this tutorial, we will be compiling a MobileNet v1 model and instructing
 TVM to offload operators to the Ethos(TM)-U55 where possible.
 
-.. GENERATED FROM PYTHON SOURCE LINES 41-69
+.. GENERATED FROM PYTHON SOURCE LINES 39-41
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 47-75
 
 Obtaining TVM
 -------------
@@ -70,7 +83,7 @@ Typing ``tvmc`` on the command line should display the following:
     TVMC - TVM driver command-line interface
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 71-102
+.. GENERATED FROM PYTHON SOURCE LINES 77-108
 
 Installing additional python dependencies
 -----------------------------------------
@@ -104,7 +117,7 @@ These packages can be installed by running the following from the command line:
   pip install -r requirements.txt
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 104-124
+.. GENERATED FROM PYTHON SOURCE LINES 110-130
 
 Obtaining the Model
 -------------------
@@ -127,7 +140,7 @@ For this tutorial we will be using the model in Tflite format.
   tar xvf mobilenet_v1_1.0_224_quant.tar
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 126-152
+.. GENERATED FROM PYTHON SOURCE LINES 132-158
 
 Compiling the model for Arm(R) Cortex(R)-M55 CPU and Ethos(TM)-U55 NPU with CMSIS-NN
 ------------------------------------------------------------------------------------
@@ -156,7 +169,7 @@ on our target device using the TVM runtime.
                --output-format=mlf
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 154-182
+.. GENERATED FROM PYTHON SOURCE LINES 160-188
 
 .. note:: Explanation of tvmc compile arguments:
 
@@ -187,7 +200,7 @@ on our target device using the TVM runtime.
   * ``--output-format=mlf`` : Output should be generated in the Model Library Format.
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 184-191
+.. GENERATED FROM PYTHON SOURCE LINES 190-197
 
 .. note:: If you don't want to make use of the microNPU and want to offload
    operators to CMSIS-NN only:
@@ -197,7 +210,7 @@ on our target device using the TVM runtime.
   * Remove the microNPU config parameter ``--target-ethos-u-accelerator_config=ethos-u55-256``
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 193-200
+.. GENERATED FROM PYTHON SOURCE LINES 199-206
 
 Extracting the generated code into the current directory
 --------------------------------------------------------
@@ -207,7 +220,7 @@ Extracting the generated code into the current directory
   tar xvf module.tar
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 202-216
+.. GENERATED FROM PYTHON SOURCE LINES 208-222
 
 Getting ImageNet labels
 -----------------------
@@ -224,7 +237,7 @@ to include them in our C application later.
   -o ./labels_mobilenet_quant_v1_224.txt
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 218-236
+.. GENERATED FROM PYTHON SOURCE LINES 224-242
 
 Getting the input image
 -----------------------
@@ -245,7 +258,7 @@ in the next step to convert the image into an array of bytes in a C header file.
   curl -sS https://s3.amazonaws.com/model-server/inputs/kitten.jpg -o ./kitten.jpg
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 238-315
+.. GENERATED FROM PYTHON SOURCE LINES 244-321
 
 Pre-processing the image
 ------------------------
@@ -325,7 +338,7 @@ Run the script from the command line:
 
   python convert_image.py ./kitten.jpg
 
-.. GENERATED FROM PYTHON SOURCE LINES 317-362
+.. GENERATED FROM PYTHON SOURCE LINES 323-368
 
 Pre-processing the labels
 -------------------------
@@ -373,7 +386,7 @@ Run the script from the command line:
 
   python convert_labels.py
 
-.. GENERATED FROM PYTHON SOURCE LINES 364-436
+.. GENERATED FROM PYTHON SOURCE LINES 370-442
 
 Writing the demo application
 ----------------------------
@@ -448,14 +461,14 @@ In addition, you will need these header files from github in your ``./include``
 
 `include files <https://github.com/apache/tvm/tree/main/apps/microtvm/ethosu/include>`_
 
-.. GENERATED FROM PYTHON SOURCE LINES 438-442
+.. GENERATED FROM PYTHON SOURCE LINES 444-448
 
 .. note::
 
   If you'd like to use FreeRTOS for task scheduling and queues, a sample application can be found here
   `demo_freertos.c <https://github.com/apache/tvm/blob/main/apps/microtvm/ethosu/src/demo_freertos.c>`
 
-.. GENERATED FROM PYTHON SOURCE LINES 444-454
+.. GENERATED FROM PYTHON SOURCE LINES 450-460
 
 Creating the linker script
 --------------------------
@@ -468,7 +481,7 @@ placed in your working directory.
 An example linker script for the FVP can be found here
 `corstone300.ld <https://github.com/apache/tvm/blob/main/apps/microtvm/ethosu/corstone300.ld>`_
 
-.. GENERATED FROM PYTHON SOURCE LINES 456-463
+.. GENERATED FROM PYTHON SOURCE LINES 462-469
 
 .. note::
 
@@ -478,7 +491,7 @@ An example linker script for the FVP can be found here
   fit into the limited SRAM available. For this reason it's important that the
   linker script places the ``ethosu_scratch`` section into DRAM (DDR).
 
-.. GENERATED FROM PYTHON SOURCE LINES 465-474
+.. GENERATED FROM PYTHON SOURCE LINES 471-480
 
 .. note::
 
@@ -490,7 +503,7 @@ An example linker script for the FVP can be found here
   ``export PATH=/opt/arm/FVP_Corstone_SSE-300_Ethos-U55/models/Linux64_GCC-6.4:/opt/arm/cmake/bin:$PATH``
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 476-484
+.. GENERATED FROM PYTHON SOURCE LINES 482-490
 
 Building the demo application using make
 ----------------------------------------
@@ -501,7 +514,7 @@ in your working directory before running ``make`` on the command line:
 An example Makefile can be found here:
 `Makefile <https://github.com/apache/tvm/blob/main/apps/microtvm/ethosu/Makefile>`_
 
-.. GENERATED FROM PYTHON SOURCE LINES 486-491
+.. GENERATED FROM PYTHON SOURCE LINES 492-497
 
 .. note::
 
@@ -509,7 +522,7 @@ An example Makefile can be found here:
     ``make FREERTOS_PATH=<FreeRTOS directory>``
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 493-573
+.. GENERATED FROM PYTHON SOURCE LINES 499-579
 
 Running the demo application
 ----------------------------
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_tflite.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_tflite.rst.txt
index 88fe81b0e..28a087209 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_tflite.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_tflite.rst.txt
@@ -27,7 +27,20 @@ microTVM with TFLite Models
 This tutorial is an introduction to working with microTVM and a TFLite
 model with Relay.
 
-.. GENERATED FROM PYTHON SOURCE LINES 29-124
+.. GENERATED FROM PYTHON SOURCE LINES 27-29
+
+.. code-block:: default
+
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 35-130
 
 .. note::
     If you want to run this tutorial on the microTVM Reference VM, download the Jupyter
@@ -125,7 +138,7 @@ Load and prepare the Pre-Trained Model
 Load the pretrained TFLite model from a file in your current
 directory into a buffer
 
-.. GENERATED FROM PYTHON SOURCE LINES 124-144
+.. GENERATED FROM PYTHON SOURCE LINES 130-150
 
 .. code-block:: default
 
@@ -156,11 +169,11 @@ directory into a buffer
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 145-146
+.. GENERATED FROM PYTHON SOURCE LINES 151-152
 
 Using the buffer, transform into a tflite model python object
 
-.. GENERATED FROM PYTHON SOURCE LINES 146-155
+.. GENERATED FROM PYTHON SOURCE LINES 152-161
 
 .. code-block:: default
 
@@ -180,11 +193,11 @@ Using the buffer, transform into a tflite model python object
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 156-157
+.. GENERATED FROM PYTHON SOURCE LINES 162-163
 
 Print out the version of the model
 
-.. GENERATED FROM PYTHON SOURCE LINES 157-160
+.. GENERATED FROM PYTHON SOURCE LINES 163-166
 
 .. code-block:: default
 
@@ -204,7 +217,7 @@ Print out the version of the model
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 161-169
+.. GENERATED FROM PYTHON SOURCE LINES 167-175
 
 Parse the python model object to convert it into a relay module
 and weights.
@@ -215,7 +228,7 @@ If you are unsure what that might be, this can be discovered by using
 the ``visualize.py`` script within the Tensorflow project.
 See `How do I inspect a .tflite file? <https://www.tensorflow.org/lite/guide/faq>`_
 
-.. GENERATED FROM PYTHON SOURCE LINES 169-178
+.. GENERATED FROM PYTHON SOURCE LINES 175-184
 
 .. code-block:: default
 
@@ -235,7 +248,7 @@ See `How do I inspect a .tflite file? <https://www.tensorflow.org/lite/guide/faq
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 179-188
+.. GENERATED FROM PYTHON SOURCE LINES 185-194
 
 Defining the target
 -------------------
@@ -247,7 +260,7 @@ TARGET, the C Runtime as the RUNTIME and a proper board/VM to run it (Zephyr wil
 QEMU VM based on BOARD. In the example below the x86 arch is selected and a x86 VM is picked up accordingly:
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 188-218
+.. GENERATED FROM PYTHON SOURCE LINES 194-224
 
 .. code-block:: default
 
@@ -288,11 +301,11 @@ QEMU VM based on BOARD. In the example below the x86 arch is selected and a x86
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 219-220
+.. GENERATED FROM PYTHON SOURCE LINES 225-226
 
 Now, compile the model for the target:
 
-.. GENERATED FROM PYTHON SOURCE LINES 220-303
+.. GENERATED FROM PYTHON SOURCE LINES 226-309
 
 .. code-block:: default
 
@@ -416,14 +429,14 @@ Now, compile the model for the target:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 304-308
+.. GENERATED FROM PYTHON SOURCE LINES 310-314
 
 Next, establish a session with the simulated device and run the
 computation. The `with session` line would typically flash an attached
 microcontroller, but in this tutorial, it simply launches a subprocess
 to stand in for an attached microcontroller.
 
-.. GENERATED FROM PYTHON SOURCE LINES 308-325
+.. GENERATED FROM PYTHON SOURCE LINES 314-331
 
 .. code-block:: default
 
diff --git a/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt b/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
index 9aea24f02..112387f48 100644
--- a/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
+++ b/docs/_sources/how_to/work_with_microtvm/micro_train.rst.txt
@@ -225,7 +225,7 @@ take about **2 minutes** to download the Stanford Cars, while COCO 2017 validati
  .. code-block:: none
 
 
-    '/tmp/tmpgv6n6zt9/images/random'
+    '/tmp/tmpns2xhuv2/images/random'
 
 
 
@@ -325,8 +325,8 @@ objects to other stuff? We can display some examples from our datasets using ``m
 
  .. code-block:: none
 
-    /tmp/tmpgv6n6zt9/images/target contains 8144 images
-    /tmp/tmpgv6n6zt9/images/random contains 5000 images
+    /tmp/tmpns2xhuv2/images/target contains 8144 images
+    /tmp/tmpns2xhuv2/images/random contains 5000 images
 
 
 
@@ -501,13 +501,13 @@ the time on our validation set).
  .. code-block:: none
 
     Epoch 1/3
-    328/328 - 55s - loss: 0.2048 - accuracy: 0.9284 - val_loss: 0.1254 - val_accuracy: 0.9573
+    328/328 - 55s - loss: 0.2157 - accuracy: 0.9258 - val_loss: 0.1582 - val_accuracy: 0.9532
     Epoch 2/3
-    328/328 - 52s - loss: 0.0951 - accuracy: 0.9654 - val_loss: 0.1188 - val_accuracy: 0.9626
+    328/328 - 52s - loss: 0.1005 - accuracy: 0.9636 - val_loss: 0.1788 - val_accuracy: 0.9535
     Epoch 3/3
... 10721 lines suppressed ...