You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by lm...@apache.org on 2020/11/13 12:22:29 UTC

[incubator-tvm-site] branch asf-site updated: Docs build at Fri Nov 13 04:22:03 PST 2020

This is an automated email from the ASF dual-hosted git repository.

lmzheng pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-tvm-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new ad1e394  Docs build at Fri Nov 13 04:22:03 PST 2020
ad1e394 is described below

commit ad1e39441c41b176a1c32aa10507be617e88f890
Author: Lianmin Zheng <li...@gmail.com>
AuthorDate: Fri Nov 13 04:22:04 2020 -0800

    Docs build at Fri Nov 13 04:22:03 PST 2020
---
 .../micro_reference_vm.ipynb                       |   43 +
 .../tune_simple_template.py                        |    2 +-
 .../tune_simple_template.ipynb                     |    2 +-
 .../tvmc_command_line_driver.py                    |    6 +-
 .../tune_conv2d_cuda.ipynb                         |    2 +-
 .../tune_relay_cuda.py                             |   13 +-
 .../tune_network_cuda.py                           |  302 ++
 .../tune_relay_mobile_gpu.ipynb                    |    4 +-
 .../tune_conv2d_cuda.py                            |    3 +-
 .../tune_conv2d_layer_cuda.py                      |   14 +-
 .../tune_relay_cuda.ipynb                          |    6 +-
 .../micro_reference_vm.py                          |  139 +
 .../tune_relay_x86.py                              |    6 +-
 .../tune_matmul_x86.py                             |    2 +-
 .../tune_relay_x86.ipynb                           |    4 +-
 .../tune_relay_arm.py                              |    6 +-
 .../tune_conv2d_layer_cuda.ipynb                   |    6 +-
 .../tune_network_cuda.ipynb                        |  147 +
 .../tvmc_command_line_driver.ipynb                 |    2 +-
 .../tune_relay_mobile_gpu.py                       |    6 +-
 .../tune_matmul_x86.ipynb                          |    2 +-
 .../tune_relay_arm.ipynb                           |    4 +-
 docs/_images/sphx_glr_micro_reference_vm_thumb.png |  Bin 0 -> 26786 bytes
 docs/_images/sphx_glr_tune_network_cuda_thumb.png  |  Bin 0 -> 26786 bytes
 docs/_sources/deploy/arm_compute_lib.rst.txt       |    2 +-
 docs/_sources/deploy/index.rst.txt                 |    1 +
 docs/_sources/deploy/vitis_ai.rst.txt              |  652 +++
 .../auto_scheduler/sg_execution_times.rst.txt      |    7 +-
 .../auto_scheduler/tune_conv2d_layer_cuda.rst.txt  | 1262 +----
 .../auto_scheduler/tune_matmul_x86.rst.txt         |   71 +-
 .../auto_scheduler/tune_network_cuda.rst.txt       |  381 ++
 .../tutorials/autotvm/sg_execution_times.rst.txt   |   16 +-
 .../tutorials/autotvm/tune_conv2d_cuda.rst.txt     |   47 +-
 .../tutorials/autotvm/tune_relay_arm.rst.txt       |    6 +-
 .../tutorials/autotvm/tune_relay_cuda.rst.txt      |   13 +-
 .../autotvm/tune_relay_mobile_gpu.rst.txt          |    6 +-
 .../tutorials/autotvm/tune_relay_x86.rst.txt       |    6 +-
 .../tutorials/autotvm/tune_simple_template.rst.txt |   22 +-
 .../tutorials/dev/bring_your_own_datatypes.rst.txt |    2 +-
 .../tutorials/dev/low_level_custom_pass.rst.txt    |   66 +-
 .../tutorials/dev/sg_execution_times.rst.txt       |    8 +-
 docs/_sources/tutorials/dev/use_pass_infra.rst.txt | 3099 +-----------
 .../frontend/deploy_model_on_android.rst.txt       |    2 +-
 .../deploy_object_detection_pytorch.rst.txt        |    2 +-
 .../tutorials/frontend/deploy_prequantized.rst.txt |    2 +-
 .../frontend/deploy_prequantized_tflite.rst.txt    |    4 +-
 .../tutorials/frontend/deploy_ssd_gluoncv.rst.txt  |  126 +-
 docs/_sources/tutorials/frontend/from_onnx.rst.txt |   13 +-
 .../tutorials/frontend/from_pytorch.rst.txt        |    9 +
 .../tutorials/frontend/from_tensorflow.rst.txt     | 1963 +++++++-
 .../tutorials/frontend/sg_execution_times.rst.txt  |   40 +-
 .../tutorials/frontend/using_external_lib.rst.txt  |   20 -
 .../get_started/cross_compilation_and_rpc.rst.txt  |    2 +-
 .../get_started/relay_quick_start.rst.txt          |    2 +-
 .../get_started/sg_execution_times.rst.txt         |   10 +-
 .../get_started/tensor_expr_get_started.rst.txt    |    2 +-
 .../get_started/tvmc_command_line_driver.rst.txt   |    6 +-
 docs/_sources/tutorials/index.rst.txt              |  332 +-
 docs/_sources/tutorials/language/reduction.rst.txt |  155 +-
 docs/_sources/tutorials/language/scan.rst.txt      |   93 +-
 .../tutorials/language/schedule_primitives.rst.txt |  341 +-
 .../tutorials/language/sg_execution_times.rst.txt  |   18 +-
 docs/_sources/tutorials/language/tensorize.rst.txt |  136 +-
 .../tutorials/language/tuple_inputs.rst.txt        |  109 +-
 .../tutorials/micro/micro_reference_vm.rst.txt     |  158 +
 docs/_sources/tutorials/micro/micro_tflite.rst.txt |    9 +
 .../tutorials/micro/sg_execution_times.rst.txt     |    5 +-
 .../tutorials/optimize/opt_conv_cuda.rst.txt       |    2 +-
 .../tutorials/optimize/opt_conv_tensorcore.rst.txt |   68 +-
 docs/_sources/tutorials/optimize/opt_gemm.rst.txt  |  237 +-
 .../tutorials/optimize/sg_execution_times.rst.txt  |   10 +-
 docs/_sources/tutorials/topi/intro_topi.rst.txt    |  359 +-
 .../tutorials/topi/sg_execution_times.rst.txt      |    4 +-
 .../tutorials/autotvm/sg_execution_times.rst.txt   |    4 +-
 .../vta/tutorials/autotvm/tune_relay_vta.rst.txt   |    2 +-
 .../frontend/deploy_classification.rst.txt         |    4 +-
 .../tutorials/frontend/sg_execution_times.rst.txt  |    4 +-
 docs/_sources/vta/tutorials/index.rst.txt          |   32 +-
 .../_sources/vta/tutorials/matrix_multiply.rst.txt |   97 +-
 .../vta/tutorials/optimize/convolution_opt.rst.txt |  101 +-
 .../tutorials/optimize/matrix_multiply_opt.rst.txt |   93 +-
 .../tutorials/optimize/sg_execution_times.rst.txt  |    6 +-
 .../vta/tutorials/sg_execution_times.rst.txt       |    6 +-
 .../_sources/vta/tutorials/vta_get_started.rst.txt |   62 +-
 docs/api/doxygen/algorithm_8h_source.html          |    2 +-
 docs/api/doxygen/analyzer_8h_source.html           |    4 +-
 docs/api/doxygen/annotated.html                    |  101 +-
 docs/api/doxygen/annotation_8h_source.html         |    2 +-
 docs/api/doxygen/auto__schedule_8h.html            |    2 +-
 docs/api/doxygen/auto__schedule_8h__incl.svg       |  945 ++--
 docs/api/doxygen/auto__schedule_8h_source.html     |    4 +-
 docs/api/doxygen/auto__scheduler_2feature_8h.html  |    2 +-
 .../doxygen/auto__scheduler_2feature_8h__incl.svg  |  899 ++--
 docs/api/doxygen/autodiff_8h_source.html           |    2 +-
 docs/api/doxygen/base_8h_source.html               |    2 +-
 docs/api/doxygen/bitserial_8h_source.html          |    4 +-
 docs/api/doxygen/bound_8h_source.html              |    2 +-
 docs/api/doxygen/broadcast_8h.html                 |    4 +-
 docs/api/doxygen/buffer_8h.html                    |    6 +-
 docs/api/doxygen/buffer_8h_source.html             |   41 +-
 docs/api/doxygen/bytecode_8h_source.html           |    2 +-
 docs/api/doxygen/c__runtime__api_8h_source.html    |    2 +-
 docs/api/doxygen/classes.html                      |  339 +-
 docs/api/doxygen/classtvm_1_1BaseAttrsNode.html    |   10 +-
 .../doxygen/classtvm_1_1BaseExprNode-members.html  |    3 +-
 docs/api/doxygen/classtvm_1_1BaseExprNode.html     |   31 +-
 .../classtvm_1_1BaseExprNode__coll__graph.svg      |   59 +-
 .../classtvm_1_1BaseExprNode__inherit__graph.svg   |  234 +-
 .../doxygen/classtvm_1_1BaseFuncNode-members.html  |    2 +-
 docs/api/doxygen/classtvm_1_1BaseFuncNode.html     |    9 +-
 .../classtvm_1_1BaseFuncNode__coll__graph.svg      |  208 +-
 .../classtvm_1_1BaseFuncNode__inherit__graph.svg   |   38 +-
 docs/api/doxygen/classtvm_1_1Bool-members.html     |    4 +-
 docs/api/doxygen/classtvm_1_1Bool.html             |   24 +-
 .../classtvm_1_1ConstructorNode-members.html       |    2 +-
 docs/api/doxygen/classtvm_1_1ConstructorNode.html  |    9 +-
 .../classtvm_1_1ConstructorNode__coll__graph.svg   |  312 +-
 ...classtvm_1_1ConstructorNode__inherit__graph.svg |   38 +-
 docs/api/doxygen/classtvm_1_1FloatImm-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1FloatImm.html         |   17 +-
 .../doxygen/classtvm_1_1FloatImmNode-members.html  |   11 +-
 docs/api/doxygen/classtvm_1_1FloatImmNode.html     |    8 +-
 .../classtvm_1_1FloatImmNode__coll__graph.svg      |   65 +-
 .../classtvm_1_1FloatImmNode__inherit__graph.svg   |   35 +-
 .../doxygen/classtvm_1_1GlobalVarNode-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1GlobalVarNode.html    |    9 +-
 .../classtvm_1_1GlobalVarNode__coll__graph.svg     |  286 +-
 .../classtvm_1_1GlobalVarNode__inherit__graph.svg  |   38 +-
 docs/api/doxygen/classtvm_1_1IntImm-members.html   |    2 +-
 docs/api/doxygen/classtvm_1_1IntImm.html           |   17 +-
 .../doxygen/classtvm_1_1IntImmNode-members.html    |   11 +-
 docs/api/doxygen/classtvm_1_1IntImmNode.html       |    8 +-
 .../classtvm_1_1IntImmNode__coll__graph.svg        |   65 +-
 .../classtvm_1_1IntImmNode__inherit__graph.svg     |   35 +-
 docs/api/doxygen/classtvm_1_1Integer-members.html  |    4 +-
 docs/api/doxygen/classtvm_1_1Integer.html          |   26 +-
 docs/api/doxygen/classtvm_1_1OpNode-members.html   |    2 +-
 docs/api/doxygen/classtvm_1_1OpNode.html           |    9 +-
 .../doxygen/classtvm_1_1OpNode__coll__graph.svg    |  320 +-
 .../doxygen/classtvm_1_1OpNode__inherit__graph.svg |   38 +-
 .../doxygen/classtvm_1_1PrimExprNode-members.html  |    5 +-
 docs/api/doxygen/classtvm_1_1PrimExprNode.html     |    8 +-
 .../classtvm_1_1PrimExprNode__coll__graph.svg      |   65 +-
 .../classtvm_1_1PrimExprNode__inherit__graph.svg   |   35 +-
 .../doxygen/classtvm_1_1RelayExprNode-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1RelayExprNode.html    |   31 +-
 .../classtvm_1_1RelayExprNode__coll__graph.svg     |  136 +-
 .../classtvm_1_1RelayExprNode__inherit__graph.svg  |   98 +-
 ...asstvm_1_1arith_1_1IterMapExprNode-members.html |    9 +-
 .../classtvm_1_1arith_1_1IterMapExprNode.html      |    8 +-
 ...vm_1_1arith_1_1IterMapExprNode__coll__graph.svg |   65 +-
 ...1_1arith_1_1IterMapExprNode__inherit__graph.svg |   35 +-
 ...stvm_1_1arith_1_1IterSplitExprNode-members.html |   11 +-
 .../classtvm_1_1arith_1_1IterSplitExprNode.html    |    8 +-
 ..._1_1arith_1_1IterSplitExprNode__coll__graph.svg |  320 +-
 ...1arith_1_1IterSplitExprNode__inherit__graph.svg |   35 +-
 ...asstvm_1_1arith_1_1IterSumExprNode-members.html |   11 +-
 .../classtvm_1_1arith_1_1IterSumExprNode.html      |    8 +-
 ...vm_1_1arith_1_1IterSumExprNode__coll__graph.svg |  298 +-
 ...1_1arith_1_1IterSumExprNode__inherit__graph.svg |   35 +-
 ...__scheduler_1_1ProgramMeasurerNode-members.html |   15 +-
 ..._1_1auto__scheduler_1_1ProgramMeasurerNode.html |   21 +-
 ...heduler_1_1ProgramMeasurerNode__coll__graph.svg |  127 +-
 ...uler_1_1ProgramMeasurerNode__inherit__graph.svg |   43 +-
 .../classtvm_1_1relay_1_1CallNode-members.html     |    2 +-
 .../api/doxygen/classtvm_1_1relay_1_1CallNode.html |    9 +-
 .../classtvm_1_1relay_1_1CallNode__coll__graph.svg |  228 +-
 ...asstvm_1_1relay_1_1CallNode__inherit__graph.svg |   38 +-
 .../classtvm_1_1relay_1_1ConstantNode-members.html |    2 +-
 .../doxygen/classtvm_1_1relay_1_1ConstantNode.html |    9 +-
 ...sstvm_1_1relay_1_1ConstantNode__coll__graph.svg |  300 +-
 ...vm_1_1relay_1_1ConstantNode__inherit__graph.svg |   38 +-
 .../classtvm_1_1relay_1_1FunctionNode-members.html |    2 +-
 .../doxygen/classtvm_1_1relay_1_1FunctionNode.html |    9 +-
 ...sstvm_1_1relay_1_1FunctionNode__coll__graph.svg |  284 +-
 ...vm_1_1relay_1_1FunctionNode__inherit__graph.svg |   38 +-
 .../classtvm_1_1relay_1_1IfNode-members.html       |    2 +-
 docs/api/doxygen/classtvm_1_1relay_1_1IfNode.html  |    9 +-
 .../classtvm_1_1relay_1_1IfNode__coll__graph.svg   |  204 +-
 ...classtvm_1_1relay_1_1IfNode__inherit__graph.svg |   38 +-
 .../classtvm_1_1relay_1_1LetNode-members.html      |    2 +-
 docs/api/doxygen/classtvm_1_1relay_1_1LetNode.html |    9 +-
 .../classtvm_1_1relay_1_1LetNode__coll__graph.svg  |  248 +-
 ...lasstvm_1_1relay_1_1LetNode__inherit__graph.svg |   38 +-
 .../classtvm_1_1relay_1_1MatchNode-members.html    |    2 +-
 .../doxygen/classtvm_1_1relay_1_1MatchNode.html    |    9 +-
 ...classtvm_1_1relay_1_1MatchNode__coll__graph.svg |  204 +-
 ...sstvm_1_1relay_1_1MatchNode__inherit__graph.svg |   38 +-
 ...classtvm_1_1relay_1_1RefCreateNode-members.html |    2 +-
 .../classtvm_1_1relay_1_1RefCreateNode.html        |    9 +-
 ...stvm_1_1relay_1_1RefCreateNode__coll__graph.svg |  200 +-
 ...m_1_1relay_1_1RefCreateNode__inherit__graph.svg |   38 +-
 .../classtvm_1_1relay_1_1RefReadNode-members.html  |    2 +-
 .../doxygen/classtvm_1_1relay_1_1RefReadNode.html  |    9 +-
 ...asstvm_1_1relay_1_1RefReadNode__coll__graph.svg |  200 +-
 ...tvm_1_1relay_1_1RefReadNode__inherit__graph.svg |   38 +-
 .../classtvm_1_1relay_1_1RefWriteNode-members.html |    2 +-
 .../doxygen/classtvm_1_1relay_1_1RefWriteNode.html |    9 +-
 ...sstvm_1_1relay_1_1RefWriteNode__coll__graph.svg |  202 +-
 ...vm_1_1relay_1_1RefWriteNode__inherit__graph.svg |   38 +-
 .../classtvm_1_1relay_1_1TempExprNode-members.html |    2 +-
 .../doxygen/classtvm_1_1relay_1_1TempExprNode.html |    9 +-
 ...sstvm_1_1relay_1_1TempExprNode__coll__graph.svg |  166 +-
 ...vm_1_1relay_1_1TempExprNode__inherit__graph.svg |   38 +-
 ...sstvm_1_1relay_1_1TupleGetItemNode-members.html |    2 +-
 .../classtvm_1_1relay_1_1TupleGetItemNode.html     |    9 +-
 ...m_1_1relay_1_1TupleGetItemNode__coll__graph.svg |  202 +-
 ..._1relay_1_1TupleGetItemNode__inherit__graph.svg |   38 +-
 .../classtvm_1_1relay_1_1TupleNode-members.html    |    2 +-
 .../doxygen/classtvm_1_1relay_1_1TupleNode.html    |    9 +-
 ...classtvm_1_1relay_1_1TupleNode__coll__graph.svg |  160 +-
 ...sstvm_1_1relay_1_1TupleNode__inherit__graph.svg |   38 +-
 .../classtvm_1_1relay_1_1VarNode-members.html      |    2 +-
 docs/api/doxygen/classtvm_1_1relay_1_1VarNode.html |    9 +-
 .../classtvm_1_1relay_1_1VarNode__coll__graph.svg  |  192 +-
 ...lasstvm_1_1relay_1_1VarNode__inherit__graph.svg |   38 +-
 .../doxygen/classtvm_1_1tir_1_1Add-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Add.html       |   14 +-
 .../classtvm_1_1tir_1_1AddNode-members.html        |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1AddNode.html   |    8 +-
 .../classtvm_1_1tir_1_1AddNode__coll__graph.svg    |  268 +-
 .../classtvm_1_1tir_1_1AddNode__inherit__graph.svg |   35 +-
 .../classtvm_1_1tir_1_1Allocate-members.html       |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Allocate.html  |   14 +-
 .../classtvm_1_1tir_1_1AllocateNode-members.html   |    9 +-
 .../doxygen/classtvm_1_1tir_1_1AllocateNode.html   |   12 +-
 ...lasstvm_1_1tir_1_1AllocateNode__coll__graph.svg |  342 +-
 ...stvm_1_1tir_1_1AllocateNode__inherit__graph.svg |   49 +-
 .../doxygen/classtvm_1_1tir_1_1And-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1And.html       |   14 +-
 .../classtvm_1_1tir_1_1AndNode-members.html        |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1AndNode.html   |    8 +-
 .../classtvm_1_1tir_1_1AndNode__coll__graph.svg    |  250 +-
 .../classtvm_1_1tir_1_1AndNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Any-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Any.html       |    9 +-
 .../classtvm_1_1tir_1_1AnyNode-members.html        |   11 +-
 docs/api/doxygen/classtvm_1_1tir_1_1AnyNode.html   |    8 +-
 .../classtvm_1_1tir_1_1AnyNode__coll__graph.svg    |   65 +-
 .../classtvm_1_1tir_1_1AnyNode__inherit__graph.svg |   35 +-
 .../classtvm_1_1tir_1_1AssertStmt-members.html     |    2 +-
 .../api/doxygen/classtvm_1_1tir_1_1AssertStmt.html |   14 +-
 .../classtvm_1_1tir_1_1AssertStmtNode-members.html |    9 +-
 .../doxygen/classtvm_1_1tir_1_1AssertStmtNode.html |   12 +-
 ...sstvm_1_1tir_1_1AssertStmtNode__coll__graph.svg |  216 +-
 ...vm_1_1tir_1_1AssertStmtNode__inherit__graph.svg |   49 +-
 .../classtvm_1_1tir_1_1AttrStmt-members.html       |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1AttrStmt.html  |   14 +-
 .../classtvm_1_1tir_1_1AttrStmtNode-members.html   |   11 +-
 .../doxygen/classtvm_1_1tir_1_1AttrStmtNode.html   |   12 +-
 ...lasstvm_1_1tir_1_1AttrStmtNode__coll__graph.svg |  374 +-
 ...stvm_1_1tir_1_1AttrStmtNode__inherit__graph.svg |   49 +-
 .../classtvm_1_1tir_1_1BinaryOpNode-members.html   |    9 +-
 .../doxygen/classtvm_1_1tir_1_1BinaryOpNode.html   |    8 +-
 ...lasstvm_1_1tir_1_1BinaryOpNode__coll__graph.svg |  250 +-
 ...stvm_1_1tir_1_1BinaryOpNode__inherit__graph.svg |   35 +-
 .../classtvm_1_1tir_1_1Broadcast-members.html      |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Broadcast.html |   14 +-
 .../classtvm_1_1tir_1_1BroadcastNode-members.html  |   11 +-
 .../doxygen/classtvm_1_1tir_1_1BroadcastNode.html  |    8 +-
 ...asstvm_1_1tir_1_1BroadcastNode__coll__graph.svg |  250 +-
 ...tvm_1_1tir_1_1BroadcastNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Buffer-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Buffer.html    |   14 +-
 .../classtvm_1_1tir_1_1BufferLoad-members.html     |    2 +-
 .../api/doxygen/classtvm_1_1tir_1_1BufferLoad.html |   14 +-
 .../classtvm_1_1tir_1_1BufferLoadNode-members.html |    9 +-
 .../doxygen/classtvm_1_1tir_1_1BufferLoadNode.html |    8 +-
 ...sstvm_1_1tir_1_1BufferLoadNode__coll__graph.svg |  254 +-
 ...vm_1_1tir_1_1BufferLoadNode__inherit__graph.svg |   35 +-
 .../classtvm_1_1tir_1_1BufferNode-members.html     |    7 +-
 .../api/doxygen/classtvm_1_1tir_1_1BufferNode.html |   29 +-
 .../classtvm_1_1tir_1_1BufferNode__coll__graph.svg |  420 +-
 ...asstvm_1_1tir_1_1BufferNode__inherit__graph.svg |   51 +-
 .../classtvm_1_1tir_1_1BufferRealize-members.html  |    2 +-
 .../doxygen/classtvm_1_1tir_1_1BufferRealize.html  |   14 +-
 ...asstvm_1_1tir_1_1BufferRealizeNode-members.html |   11 +-
 .../classtvm_1_1tir_1_1BufferRealizeNode.html      |   26 +-
 ...vm_1_1tir_1_1BufferRealizeNode__coll__graph.svg |  298 +-
 ...1_1tir_1_1BufferRealizeNode__inherit__graph.svg |   49 +-
 .../classtvm_1_1tir_1_1BufferStore-members.html    |    2 +-
 .../doxygen/classtvm_1_1tir_1_1BufferStore.html    |   14 +-
 ...classtvm_1_1tir_1_1BufferStoreNode-members.html |   11 +-
 .../classtvm_1_1tir_1_1BufferStoreNode.html        |   12 +-
 ...stvm_1_1tir_1_1BufferStoreNode__coll__graph.svg |  256 +-
 ...m_1_1tir_1_1BufferStoreNode__inherit__graph.svg |   49 +-
 .../doxygen/classtvm_1_1tir_1_1Call-members.html   |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Call.html      |   14 +-
 .../classtvm_1_1tir_1_1CallNode-members.html       |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1CallNode.html  |    8 +-
 .../classtvm_1_1tir_1_1CallNode__coll__graph.svg   |  270 +-
 ...classtvm_1_1tir_1_1CallNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Cast-members.html   |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Cast.html      |   14 +-
 .../classtvm_1_1tir_1_1CastNode-members.html       |   11 +-
 docs/api/doxygen/classtvm_1_1tir_1_1CastNode.html  |    8 +-
 .../classtvm_1_1tir_1_1CastNode__coll__graph.svg   |  248 +-
 ...classtvm_1_1tir_1_1CastNode__inherit__graph.svg |   35 +-
 .../classtvm_1_1tir_1_1CmpOpNode-members.html      |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1CmpOpNode.html |    8 +-
 .../classtvm_1_1tir_1_1CmpOpNode__coll__graph.svg  |  250 +-
 ...lasstvm_1_1tir_1_1CmpOpNode__inherit__graph.svg |   35 +-
 .../classtvm_1_1tir_1_1CommReducer-members.html    |    2 +-
 .../doxygen/classtvm_1_1tir_1_1CommReducer.html    |   14 +-
 ...classtvm_1_1tir_1_1CommReducerNode-members.html |    5 +-
 .../classtvm_1_1tir_1_1CommReducerNode.html        |   29 +-
 ...stvm_1_1tir_1_1CommReducerNode__coll__graph.svg |  127 +-
 ...m_1_1tir_1_1CommReducerNode__inherit__graph.svg |   39 +-
 .../doxygen/classtvm_1_1tir_1_1Div-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Div.html       |   14 +-
 .../classtvm_1_1tir_1_1DivNode-members.html        |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1DivNode.html   |    8 +-
 .../classtvm_1_1tir_1_1DivNode__coll__graph.svg    |  268 +-
 .../classtvm_1_1tir_1_1DivNode__inherit__graph.svg |   35 +-
 .../api/doxygen/classtvm_1_1tir_1_1EQ-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1EQ.html        |   14 +-
 .../doxygen/classtvm_1_1tir_1_1EQNode-members.html |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1EQNode.html    |    8 +-
 .../classtvm_1_1tir_1_1EQNode__coll__graph.svg     |  268 +-
 .../classtvm_1_1tir_1_1EQNode__inherit__graph.svg  |   35 +-
 .../classtvm_1_1tir_1_1Evaluate-members.html       |    4 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Evaluate.html  |   36 +-
 .../classtvm_1_1tir_1_1EvaluateNode-members.html   |   11 +-
 .../doxygen/classtvm_1_1tir_1_1EvaluateNode.html   |   12 +-
 ...lasstvm_1_1tir_1_1EvaluateNode__coll__graph.svg |  174 +-
 ...stvm_1_1tir_1_1EvaluateNode__inherit__graph.svg |   49 +-
 .../classtvm_1_1tir_1_1FloorDiv-members.html       |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1FloorDiv.html  |   14 +-
 .../classtvm_1_1tir_1_1FloorDivNode-members.html   |    9 +-
 .../doxygen/classtvm_1_1tir_1_1FloorDivNode.html   |    8 +-
 ...lasstvm_1_1tir_1_1FloorDivNode__coll__graph.svg |  268 +-
 ...stvm_1_1tir_1_1FloorDivNode__inherit__graph.svg |   35 +-
 .../classtvm_1_1tir_1_1FloorMod-members.html       |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1FloorMod.html  |   14 +-
 .../classtvm_1_1tir_1_1FloorModNode-members.html   |    9 +-
 .../doxygen/classtvm_1_1tir_1_1FloorModNode.html   |    8 +-
 ...lasstvm_1_1tir_1_1FloorModNode__coll__graph.svg |  268 +-
 ...stvm_1_1tir_1_1FloorModNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1For-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1For.html       |   14 +-
 .../classtvm_1_1tir_1_1ForNode-members.html        |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1ForNode.html   |   12 +-
 .../classtvm_1_1tir_1_1ForNode__coll__graph.svg    |  262 +-
 .../classtvm_1_1tir_1_1ForNode__inherit__graph.svg |   49 +-
 .../api/doxygen/classtvm_1_1tir_1_1GE-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1GE.html        |   14 +-
 .../doxygen/classtvm_1_1tir_1_1GENode-members.html |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1GENode.html    |    8 +-
 .../classtvm_1_1tir_1_1GENode__coll__graph.svg     |  268 +-
 .../classtvm_1_1tir_1_1GENode__inherit__graph.svg  |   35 +-
 .../api/doxygen/classtvm_1_1tir_1_1GT-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1GT.html        |   14 +-
 .../doxygen/classtvm_1_1tir_1_1GTNode-members.html |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1GTNode.html    |    8 +-
 .../classtvm_1_1tir_1_1GTNode__coll__graph.svg     |  268 +-
 .../classtvm_1_1tir_1_1GTNode__inherit__graph.svg  |   35 +-
 .../classtvm_1_1tir_1_1IfThenElse-members.html     |    2 +-
 .../api/doxygen/classtvm_1_1tir_1_1IfThenElse.html |   14 +-
 .../classtvm_1_1tir_1_1IfThenElseNode-members.html |   11 +-
 .../doxygen/classtvm_1_1tir_1_1IfThenElseNode.html |   12 +-
 ...sstvm_1_1tir_1_1IfThenElseNode__coll__graph.svg |  216 +-
 ...vm_1_1tir_1_1IfThenElseNode__inherit__graph.svg |   49 +-
 .../classtvm_1_1tir_1_1IterVar-members.html        |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1IterVar.html   |   14 +-
 .../classtvm_1_1tir_1_1IterVarNode-members.html    |    9 +-
 .../doxygen/classtvm_1_1tir_1_1IterVarNode.html    |   29 +-
 ...classtvm_1_1tir_1_1IterVarNode__coll__graph.svg |  356 +-
 ...sstvm_1_1tir_1_1IterVarNode__inherit__graph.svg |   39 +-
 .../api/doxygen/classtvm_1_1tir_1_1LE-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1LE.html        |   14 +-
 .../api/doxygen/classtvm_1_1tir_1_1LT-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1LT.html        |   14 +-
 .../doxygen/classtvm_1_1tir_1_1LTNode-members.html |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1LTNode.html    |    8 +-
 .../classtvm_1_1tir_1_1LTNode__coll__graph.svg     |  268 +-
 .../classtvm_1_1tir_1_1LTNode__inherit__graph.svg  |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Let-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Let.html       |   14 +-
 .../classtvm_1_1tir_1_1LetNode-members.html        |   13 +-
 docs/api/doxygen/classtvm_1_1tir_1_1LetNode.html   |    8 +-
 .../classtvm_1_1tir_1_1LetNode__coll__graph.svg    |  244 +-
 .../classtvm_1_1tir_1_1LetNode__inherit__graph.svg |   35 +-
 .../classtvm_1_1tir_1_1LetStmt-members.html        |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1LetStmt.html   |   14 +-
 .../classtvm_1_1tir_1_1LetStmtNode-members.html    |   13 +-
 .../doxygen/classtvm_1_1tir_1_1LetStmtNode.html    |   12 +-
 ...classtvm_1_1tir_1_1LetStmtNode__coll__graph.svg |  256 +-
 ...sstvm_1_1tir_1_1LetStmtNode__inherit__graph.svg |   49 +-
 .../doxygen/classtvm_1_1tir_1_1Load-members.html   |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Load.html      |   14 +-
 .../classtvm_1_1tir_1_1LoadNode-members.html       |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1LoadNode.html  |    8 +-
 .../classtvm_1_1tir_1_1LoadNode__coll__graph.svg   |  244 +-
 ...classtvm_1_1tir_1_1LoadNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Max-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Max.html       |   14 +-
 .../classtvm_1_1tir_1_1MaxNode-members.html        |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1MaxNode.html   |    8 +-
 .../classtvm_1_1tir_1_1MaxNode__coll__graph.svg    |  268 +-
 .../classtvm_1_1tir_1_1MaxNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Min-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Min.html       |   14 +-
 .../classtvm_1_1tir_1_1MinNode-members.html        |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1MinNode.html   |    8 +-
 .../classtvm_1_1tir_1_1MinNode__coll__graph.svg    |  268 +-
 .../classtvm_1_1tir_1_1MinNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Mod-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Mod.html       |   14 +-
 .../classtvm_1_1tir_1_1ModNode-members.html        |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1ModNode.html   |    8 +-
 .../classtvm_1_1tir_1_1ModNode__coll__graph.svg    |  268 +-
 .../classtvm_1_1tir_1_1ModNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Mul-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Mul.html       |   14 +-
 .../classtvm_1_1tir_1_1MulNode-members.html        |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1MulNode.html   |    8 +-
 .../classtvm_1_1tir_1_1MulNode__coll__graph.svg    |  268 +-
 .../classtvm_1_1tir_1_1MulNode__inherit__graph.svg |   35 +-
 .../api/doxygen/classtvm_1_1tir_1_1NE-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1NE.html        |   14 +-
 .../doxygen/classtvm_1_1tir_1_1NENode-members.html |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1NENode.html    |    8 +-
 .../classtvm_1_1tir_1_1NENode__coll__graph.svg     |  268 +-
 .../classtvm_1_1tir_1_1NENode__inherit__graph.svg  |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Not-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Not.html       |   18 +-
 .../classtvm_1_1tir_1_1NotNode-members.html        |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1NotNode.html   |    8 +-
 .../classtvm_1_1tir_1_1NotNode__coll__graph.svg    |  248 +-
 .../classtvm_1_1tir_1_1NotNode__inherit__graph.svg |   35 +-
 .../api/doxygen/classtvm_1_1tir_1_1Or-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Or.html        |   14 +-
 .../doxygen/classtvm_1_1tir_1_1OrNode-members.html |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1OrNode.html    |    8 +-
 .../classtvm_1_1tir_1_1OrNode__coll__graph.svg     |  250 +-
 .../classtvm_1_1tir_1_1OrNode__inherit__graph.svg  |   35 +-
 .../classtvm_1_1tir_1_1Prefetch-members.html       |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Prefetch.html  |   14 +-
 .../classtvm_1_1tir_1_1PrefetchNode-members.html   |   11 +-
 .../doxygen/classtvm_1_1tir_1_1PrefetchNode.html   |   26 +-
 ...lasstvm_1_1tir_1_1PrefetchNode__coll__graph.svg |  184 +-
 ...stvm_1_1tir_1_1PrefetchNode__inherit__graph.svg |   49 +-
 .../classtvm_1_1tir_1_1PrimFunc-members.html       |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1PrimFunc.html  |   17 +-
 .../classtvm_1_1tir_1_1PrimFuncNode-members.html   |    2 +-
 .../doxygen/classtvm_1_1tir_1_1PrimFuncNode.html   |    9 +-
 ...lasstvm_1_1tir_1_1PrimFuncNode__coll__graph.svg |  328 +-
 ...stvm_1_1tir_1_1PrimFuncNode__inherit__graph.svg |   38 +-
 .../classtvm_1_1tir_1_1ProducerLoad-members.html   |    2 +-
 .../doxygen/classtvm_1_1tir_1_1ProducerLoad.html   |   14 +-
 ...lasstvm_1_1tir_1_1ProducerLoadNode-members.html |    9 +-
 .../classtvm_1_1tir_1_1ProducerLoadNode.html       |    8 +-
 ...tvm_1_1tir_1_1ProducerLoadNode__coll__graph.svg |  240 +-
 ..._1_1tir_1_1ProducerLoadNode__inherit__graph.svg |   35 +-
 ...classtvm_1_1tir_1_1ProducerRealize-members.html |    2 +-
 .../classtvm_1_1tir_1_1ProducerRealize.html        |   14 +-
 ...stvm_1_1tir_1_1ProducerRealizeNode-members.html |    9 +-
 .../classtvm_1_1tir_1_1ProducerRealizeNode.html    |   12 +-
 ..._1_1tir_1_1ProducerRealizeNode__coll__graph.svg |  232 +-
 ...1tir_1_1ProducerRealizeNode__inherit__graph.svg |   49 +-
 .../classtvm_1_1tir_1_1ProducerStore-members.html  |    2 +-
 .../doxygen/classtvm_1_1tir_1_1ProducerStore.html  |   14 +-
 ...asstvm_1_1tir_1_1ProducerStoreNode-members.html |   11 +-
 .../classtvm_1_1tir_1_1ProducerStoreNode.html      |   12 +-
 ...vm_1_1tir_1_1ProducerStoreNode__coll__graph.svg |  206 +-
 ...1_1tir_1_1ProducerStoreNode__inherit__graph.svg |   49 +-
 .../doxygen/classtvm_1_1tir_1_1Ramp-members.html   |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Ramp.html      |   14 +-
 .../classtvm_1_1tir_1_1RampNode-members.html       |   11 +-
 docs/api/doxygen/classtvm_1_1tir_1_1RampNode.html  |    8 +-
 .../classtvm_1_1tir_1_1RampNode__coll__graph.svg   |  252 +-
 ...classtvm_1_1tir_1_1RampNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Reduce-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Reduce.html    |   14 +-
 .../classtvm_1_1tir_1_1ReduceNode-members.html     |   11 +-
 .../api/doxygen/classtvm_1_1tir_1_1ReduceNode.html |    8 +-
 .../classtvm_1_1tir_1_1ReduceNode__coll__graph.svg |  340 +-
 ...asstvm_1_1tir_1_1ReduceNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Select-members.html |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Select.html    |   14 +-
 .../classtvm_1_1tir_1_1SelectNode-members.html     |   11 +-
 .../api/doxygen/classtvm_1_1tir_1_1SelectNode.html |    8 +-
 .../classtvm_1_1tir_1_1SelectNode__coll__graph.svg |  252 +-
 ...asstvm_1_1tir_1_1SelectNode__inherit__graph.svg |   35 +-
 .../classtvm_1_1tir_1_1SeqStmt-members.html        |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1SeqStmt.html   |   21 +-
 .../classtvm_1_1tir_1_1SeqStmtNode-members.html    |    9 +-
 .../doxygen/classtvm_1_1tir_1_1SeqStmtNode.html    |   12 +-
 ...classtvm_1_1tir_1_1SeqStmtNode__coll__graph.svg |  105 +-
 ...sstvm_1_1tir_1_1SeqStmtNode__inherit__graph.svg |   49 +-
 .../classtvm_1_1tir_1_1Shuffle-members.html        |    6 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Shuffle.html   |   46 +-
 .../classtvm_1_1tir_1_1ShuffleNode-members.html    |   11 +-
 .../doxygen/classtvm_1_1tir_1_1ShuffleNode.html    |    8 +-
 ...classtvm_1_1tir_1_1ShuffleNode__coll__graph.svg |   73 +-
 ...sstvm_1_1tir_1_1ShuffleNode__inherit__graph.svg |   35 +-
 .../classtvm_1_1tir_1_1SizeVar-members.html        |    6 +-
 docs/api/doxygen/classtvm_1_1tir_1_1SizeVar.html   |   29 +-
 .../classtvm_1_1tir_1_1SizeVarNode-members.html    |   13 +-
 .../doxygen/classtvm_1_1tir_1_1SizeVarNode.html    |    8 +-
 ...classtvm_1_1tir_1_1SizeVarNode__coll__graph.svg |  382 +-
 ...sstvm_1_1tir_1_1SizeVarNode__inherit__graph.svg |   35 +-
 .../classtvm_1_1tir_1_1StmtNode-members.html       |    5 +-
 docs/api/doxygen/classtvm_1_1tir_1_1StmtNode.html  |   87 +-
 .../classtvm_1_1tir_1_1StmtNode__coll__graph.svg   |   79 +-
 ...classtvm_1_1tir_1_1StmtNode__inherit__graph.svg |  101 +-
 .../doxygen/classtvm_1_1tir_1_1Store-members.html  |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Store.html     |   14 +-
 .../classtvm_1_1tir_1_1StoreNode-members.html      |   11 +-
 docs/api/doxygen/classtvm_1_1tir_1_1StoreNode.html |   12 +-
 .../classtvm_1_1tir_1_1StoreNode__coll__graph.svg  |  216 +-
 ...lasstvm_1_1tir_1_1StoreNode__inherit__graph.svg |   49 +-
 .../classtvm_1_1tir_1_1StringImm-members.html      |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1StringImm.html |   18 +-
 .../classtvm_1_1tir_1_1StringImmNode-members.html  |   11 +-
 .../doxygen/classtvm_1_1tir_1_1StringImmNode.html  |    8 +-
 ...asstvm_1_1tir_1_1StringImmNode__coll__graph.svg |   95 +-
 ...tvm_1_1tir_1_1StringImmNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Sub-members.html    |    2 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Sub.html       |   14 +-
 .../classtvm_1_1tir_1_1SubNode-members.html        |    9 +-
 docs/api/doxygen/classtvm_1_1tir_1_1SubNode.html   |    8 +-
 .../classtvm_1_1tir_1_1SubNode__coll__graph.svg    |  268 +-
 .../classtvm_1_1tir_1_1SubNode__inherit__graph.svg |   35 +-
 .../doxygen/classtvm_1_1tir_1_1Var-members.html    |    4 +-
 docs/api/doxygen/classtvm_1_1tir_1_1Var.html       |   34 +-
 .../classtvm_1_1tir_1_1VarNode-members.html        |   11 +-
 docs/api/doxygen/classtvm_1_1tir_1_1VarNode.html   |    8 +-
 .../classtvm_1_1tir_1_1VarNode__coll__graph.svg    |  366 +-
 .../classtvm_1_1tir_1_1VarNode__inherit__graph.svg |   35 +-
 docs/api/doxygen/constant__utils_8h_source.html    |    2 +-
 docs/api/doxygen/cost__model_8h.html               |    2 +-
 docs/api/doxygen/cost__model_8h__incl.svg          |  911 ++--
 docs/api/doxygen/cuda_2dense_8h_source.html        |    4 +-
 docs/api/doxygen/cuda_2injective_8h_source.html    |    6 +-
 .../api/doxygen/cuda_2normalization_8h_source.html |    4 +-
 docs/api/doxygen/cuda_2pooling_8h_source.html      |    6 +-
 docs/api/doxygen/cuda_2reduction_8h_source.html    |    6 +-
 docs/api/doxygen/cuda_2softmax_8h_source.html      |    4 +-
 docs/api/doxygen/data__layout_8h_source.html       |    4 +-
 docs/api/doxygen/dataflow__matcher_8h_source.html  |    2 +-
 docs/api/doxygen/dataflow__pattern_8h_source.html  |    2 +-
 docs/api/doxygen/debug_8h_source.html              |    2 +-
 docs/api/doxygen/detail_2broadcast_8h_source.html  |    4 +-
 docs/api/doxygen/detail_2extern_8h_source.html     |    6 +-
 docs/api/doxygen/device__copy_8h_source.html       |    2 +-
 docs/api/doxygen/dilate_8h_source.html             |    2 +-
 docs/api/doxygen/elemwise_8h_source.html           |    4 +-
 docs/api/doxygen/error_8h_source.html              |    2 +-
 docs/api/doxygen/flatten_8h_source.html            |    2 +-
 docs/api/doxygen/functions_a.html                  |   12 +-
 docs/api/doxygen/functions_b.html                  |   12 +-
 docs/api/doxygen/functions_c.html                  |    8 +-
 docs/api/doxygen/functions_d.html                  |    2 +-
 docs/api/doxygen/functions_e.html                  |    6 +-
 docs/api/doxygen/functions_f.html                  |    8 +-
 docs/api/doxygen/functions_func_a.html             |   12 +-
 docs/api/doxygen/functions_func_b.html             |   14 +-
 docs/api/doxygen/functions_func_c.html             |    8 +-
 docs/api/doxygen/functions_func_d.html             |    2 +-
 docs/api/doxygen/functions_func_e.html             |    6 +-
 docs/api/doxygen/functions_func_f.html             |    8 +-
 docs/api/doxygen/functions_func_g.html             |    4 +-
 docs/api/doxygen/functions_func_i.html             |    6 +-
 docs/api/doxygen/functions_func_l.html             |   10 +-
 docs/api/doxygen/functions_func_m.html             |    8 +-
 docs/api/doxygen/functions_func_n.html             |    4 +-
 docs/api/doxygen/functions_func_o.html             |    2 +-
 docs/api/doxygen/functions_func_p.html             |   10 +-
 docs/api/doxygen/functions_func_r.html             |    4 +-
 docs/api/doxygen/functions_func_s.html             |   23 +-
 docs/api/doxygen/functions_func_t.html             |   11 +-
 docs/api/doxygen/functions_func_v.html             |    2 +-
 docs/api/doxygen/functions_g.html                  |    4 +-
 docs/api/doxygen/functions_h.html                  |    3 +
 docs/api/doxygen/functions_i.html                  |    6 +-
 docs/api/doxygen/functions_l.html                  |   20 +-
 docs/api/doxygen/functions_m.html                  |    8 +-
 docs/api/doxygen/functions_n.html                  |    4 +-
 docs/api/doxygen/functions_o.html                  |    2 +-
 docs/api/doxygen/functions_p.html                  |   12 +-
 docs/api/doxygen/functions_r.html                  |   14 +-
 docs/api/doxygen/functions_s.html                  |   33 +-
 docs/api/doxygen/functions_t.html                  |   13 +-
 docs/api/doxygen/functions_vars_h.html             |    3 +
 docs/api/doxygen/functions_vars_l.html             |    6 +
 docs/api/doxygen/functions_vars_r.html             |    6 +
 docs/api/doxygen/functions_vars_s.html             |    8 +-
 docs/api/doxygen/hierarchy.html                    |  126 +-
 docs/api/doxygen/image_8h_source.html              |    2 +-
 docs/api/doxygen/inherit_graph_86.svg              | 5239 ++++++++++----------
 docs/api/doxygen/inherits.html                     |    2 +-
 docs/api/doxygen/int__set_8h_source.html           |    4 +-
 docs/api/doxygen/int__solver_8h_source.html        |    4 +-
 docs/api/doxygen/ir_2adt_8h_source.html            |   10 +-
 docs/api/doxygen/ir_2attrs_8h.html                 |    3 +
 docs/api/doxygen/ir_2attrs_8h_source.html          |   93 +-
 docs/api/doxygen/ir_2expr_8h_source.html           |  121 +-
 docs/api/doxygen/ir_2function_8h_source.html       |    4 +-
 docs/api/doxygen/ir_2module_8h_source.html         |    4 +-
 docs/api/doxygen/ir_2op_8h_source.html             |    4 +-
 docs/api/doxygen/iter__affine__map_8h_source.html  |    8 +-
 .../doxygen/local__response__norm_8h_source.html   |    2 +-
 docs/api/doxygen/measure_8h.html                   |    3 +-
 docs/api/doxygen/measure_8h__incl.svg              |  915 ++--
 docs/api/doxygen/measure_8h_source.html            |  123 +-
 docs/api/doxygen/measure__record_8h.html           |    2 +-
 docs/api/doxygen/measure__record_8h__incl.svg      |  907 ++--
 docs/api/doxygen/measure__record_8h_source.html    |    8 +-
 docs/api/doxygen/memory__manager_8h_source.html    |    2 +-
 docs/api/doxygen/namespacemembers_d.html           |    5 +-
 docs/api/doxygen/namespacemembers_e.html           |    5 +-
 docs/api/doxygen/namespacemembers_func_d.html      |    7 +-
 docs/api/doxygen/namespacemembers_func_e.html      |    3 +
 docs/api/doxygen/namespacemembers_func_s.html      |    7 +-
 docs/api/doxygen/namespacemembers_func_t.html      |    2 +-
 docs/api/doxygen/namespacemembers_s.html           |    5 +-
 docs/api/doxygen/namespacemembers_t.html           |    2 +-
 docs/api/doxygen/namespacetvm_1_1detail.html       |   39 +
 docs/api/doxygen/namespacetvm_1_1relay.html        |   47 +
 .../namespacetvm_1_1relay_1_1transform.html        |    4 +-
 docs/api/doxygen/namespacetvm_1_1tir.html          |   38 +-
 docs/api/doxygen/namespacetvm_1_1topi.html         |  407 +-
 docs/api/doxygen/ndarray_8h_source.html            |    2 +-
 docs/api/doxygen/nn_2bnn_8h_source.html            |    4 +-
 docs/api/doxygen/nn_2dense_8h_source.html          |    4 +-
 docs/api/doxygen/nn_2pooling_8h_source.html        |    4 +-
 docs/api/doxygen/nn_2softmax_8h_source.html        |    8 +-
 docs/api/doxygen/node_2container_8h_source.html    |    2 +-
 docs/api/doxygen/op__strategy_8h_source.html       |    2 +-
 docs/api/doxygen/operation_8h_source.html          |   10 +-
 docs/api/doxygen/ravel__unravel_8h_source.html     |    2 +-
 docs/api/doxygen/reduce_8h_source.html             |    2 +-
 docs/api/doxygen/reduction_8h_source.html          |    8 +-
 docs/api/doxygen/relay_2adt_8h_source.html         |    4 +-
 .../doxygen/relay_2attrs_2memory_8h_source.html    |    2 +-
 docs/api/doxygen/relay_2attrs_2nn_8h_source.html   |    4 +-
 docs/api/doxygen/relay_2attrs_2transform_8h.html   |    3 +
 .../doxygen/relay_2attrs_2transform_8h_source.html |  202 +-
 docs/api/doxygen/relay_2attrs_2vm_8h_source.html   |    2 +-
 docs/api/doxygen/relay_2expr_8h_source.html        |   20 +-
 docs/api/doxygen/relay_2expr__functor_8h.html      |    7 +-
 docs/api/doxygen/relay_2expr__functor_8h__incl.svg | 1512 +++---
 .../doxygen/relay_2expr__functor_8h_source.html    |   98 +-
 docs/api/doxygen/relay_2feature_8h_source.html     |    4 +-
 docs/api/doxygen/relay_2function_8h_source.html    |    8 +-
 .../doxygen/relay_2op__attr__types_8h_source.html  |    2 +-
 docs/api/doxygen/relay_2qnn_2attrs_8h_source.html  |    2 +-
 docs/api/doxygen/relay_2transform_8h.html          |    2 +-
 docs/api/doxygen/relay_2transform_8h_source.html   |    6 +-
 docs/api/doxygen/relay_2type_8h_source.html        |    4 +-
 docs/api/doxygen/schedule_8h_source.html           |    8 +-
 docs/api/doxygen/search/all_1.js                   |   21 +-
 docs/api/doxygen/search/all_10.js                  |   14 +-
 docs/api/doxygen/search/all_12.js                  |   11 +-
 docs/api/doxygen/search/all_13.js                  |   36 +-
 docs/api/doxygen/search/all_14.js                  |    8 +-
 docs/api/doxygen/search/all_16.js                  |    6 +-
 docs/api/doxygen/search/all_2.js                   |   14 +-
 docs/api/doxygen/search/all_3.js                   |    8 +-
 docs/api/doxygen/search/all_4.js                   |    7 +-
 docs/api/doxygen/search/all_5.js                   |   11 +-
 docs/api/doxygen/search/all_6.js                   |    8 +-
 docs/api/doxygen/search/all_7.js                   |    4 +-
 docs/api/doxygen/search/all_8.js                   |    5 +-
 docs/api/doxygen/search/all_9.js                   |    8 +-
 docs/api/doxygen/search/all_c.js                   |   12 +-
 docs/api/doxygen/search/all_d.js                   |    8 +-
 docs/api/doxygen/search/all_e.js                   |    4 +-
 docs/api/doxygen/search/all_f.js                   |    4 +-
 docs/api/doxygen/search/classes_0.js               |    1 +
 docs/api/doxygen/search/classes_10.js              |    2 +-
 docs/api/doxygen/search/classes_13.js              |    4 +-
 docs/api/doxygen/search/classes_f.js               |    1 +
 docs/api/doxygen/search/functions_1.js             |   12 +-
 docs/api/doxygen/search/functions_10.js            |   12 +-
 docs/api/doxygen/search/functions_12.js            |    4 +-
 docs/api/doxygen/search/functions_13.js            |   20 +-
 docs/api/doxygen/search/functions_14.js            |    4 +-
 docs/api/doxygen/search/functions_16.js            |    2 +-
 docs/api/doxygen/search/functions_2.js             |   14 +-
 docs/api/doxygen/search/functions_3.js             |    8 +-
 docs/api/doxygen/search/functions_4.js             |    5 +-
 docs/api/doxygen/search/functions_5.js             |    7 +-
 docs/api/doxygen/search/functions_6.js             |    8 +-
 docs/api/doxygen/search/functions_7.js             |    4 +-
 docs/api/doxygen/search/functions_9.js             |    8 +-
 docs/api/doxygen/search/functions_c.js             |   10 +-
 docs/api/doxygen/search/functions_d.js             |    8 +-
 docs/api/doxygen/search/functions_e.js             |    4 +-
 docs/api/doxygen/search/functions_f.js             |    4 +-
 docs/api/doxygen/search/variables_10.js            |    2 +
 docs/api/doxygen/search/variables_11.js            |    2 +-
 docs/api/doxygen/search/variables_8.js             |    1 +
 docs/api/doxygen/search/variables_b.js             |    2 +
 docs/api/doxygen/search__policy_8h.html            |    2 +-
 docs/api/doxygen/search__policy_8h__incl.svg       |  955 ++--
 docs/api/doxygen/search__policy_8h_source.html     |    2 +-
 docs/api/doxygen/stmt_8h.html                      |    6 +-
 docs/api/doxygen/stmt_8h_source.html               |  359 +-
 docs/api/doxygen/stmt__functor_8h_source.html      |   34 +-
 ...cttvm_1_1relay_1_1ReshapeLikeAttrs-members.html |  125 +
 .../structtvm_1_1relay_1_1ReshapeLikeAttrs.html    |  273 +
 ...m_1_1relay_1_1ReshapeLikeAttrs__coll__graph.svg |  189 +
 ..._1relay_1_1ReshapeLikeAttrs__inherit__graph.svg |   90 +
 .../structtvm_1_1tir_1_1LENode-members.html        |    9 +-
 docs/api/doxygen/structtvm_1_1tir_1_1LENode.html   |    8 +-
 .../structtvm_1_1tir_1_1LENode__coll__graph.svg    |  268 +-
 .../structtvm_1_1tir_1_1LENode__inherit__graph.svg |   35 +-
 docs/api/doxygen/target__info_8h_source.html       |    2 +-
 docs/api/doxygen/tensor_8h_source.html             |    8 +-
 docs/api/doxygen/tensor__intrin_8h_source.html     |    2 +-
 docs/api/doxygen/tensor__type_8h_source.html       |    4 +-
 docs/api/doxygen/tir_2analysis_8h_source.html      |    6 +-
 docs/api/doxygen/tir_2expr_8h_source.html          |  156 +-
 docs/api/doxygen/tir_2expr__functor_8h_source.html |   22 +-
 docs/api/doxygen/tir_2function_8h_source.html      |   16 +-
 docs/api/doxygen/tir_2op_8h_source.html            |   26 +-
 .../doxygen/tir_2op__attr__types_8h_source.html    |    4 +-
 docs/api/doxygen/tir_2transform_8h_source.html     |    2 +-
 docs/api/doxygen/topi_2nn_8h_source.html           |    6 +-
 docs/api/doxygen/topi_2transform_8h.html           |    3 +
 docs/api/doxygen/topi_2transform_8h_source.html    |   55 +-
 docs/api/doxygen/transform__step_8h_source.html    |    4 +-
 docs/api/doxygen/type__relation_8h_source.html     |    2 +-
 docs/api/doxygen/utils_8h_source.html              |    2 +-
 docs/api/doxygen/var_8h_source.html                |   70 +-
 docs/api/doxygen/vision_8h_source.html             |    4 +-
 docs/api/doxygen/x86_2bnn_8h_source.html           |    2 +-
 docs/api/python/auto_scheduler.html                |   27 +-
 docs/api/python/autotvm.html                       |    4 +-
 docs/api/python/relay/index.html                   |   47 +-
 docs/api/python/relay/transform.html               |    2 +
 docs/api/python/tir.html                           |  271 +-
 docs/api/python/topi.html                          |  273 +-
 .../rust/implementors/core/clone/trait.Clone.js    |    2 +-
 docs/api/rust/implementors/core/cmp/trait.Eq.js    |    4 +-
 .../rust/implementors/core/cmp/trait.PartialEq.js  |    4 +-
 .../rust/implementors/core/convert/trait.AsRef.js  |    2 +-
 .../rust/implementors/core/convert/trait.From.js   |    2 +-
 .../implementors/core/convert/trait.TryFrom.js     |    2 +-
 docs/api/rust/implementors/core/fmt/trait.Debug.js |    4 +-
 docs/api/rust/implementors/core/hash/trait.Hash.js |    4 +-
 .../core/iter/traits/collect/trait.FromIterator.js |    2 +-
 .../rust/implementors/core/marker/trait.Freeze.js  |    2 +-
 .../rust/implementors/core/marker/trait.Send.js    |    2 +-
 .../implementors/core/marker/trait.StructuralEq.js |    1 +
 .../core/marker/trait.StructuralPartialEq.js       |    1 +
 .../rust/implementors/core/marker/trait.Sync.js    |    2 +-
 .../rust/implementors/core/marker/trait.Unpin.js   |    2 +-
 .../implementors/core/ops/deref/trait.Deref.js     |    2 +-
 .../implementors/std/panic/trait.RefUnwindSafe.js  |    2 +-
 .../implementors/std/panic/trait.UnwindSafe.js     |    2 +-
 .../implementors/tvm/runtime/trait.IsObject.js     |    2 +-
 .../implementors/tvm/runtime/trait.IsObjectRef.js  |    2 +-
 docs/api/rust/search-index.js                      |   13 +-
 docs/api/rust/source-files.js                      |    1 -
 docs/api/rust/src/test_rt_wasm32/main.rs.html      |  112 -
 docs/api/rust/src/tvm/ir/arith.rs.html             |    2 +-
 docs/api/rust/src/tvm/ir/attrs.rs.html             |    2 +-
 docs/api/rust/src/tvm/ir/diagnostics/mod.rs.html   |    8 +-
 docs/api/rust/src/tvm/ir/expr.rs.html              |   16 +-
 docs/api/rust/src/tvm/ir/function.rs.html          |    2 +-
 docs/api/rust/src/tvm/ir/module.rs.html            |  536 +-
 docs/api/rust/src/tvm/ir/op.rs.html                |    2 +-
 docs/api/rust/src/tvm/ir/relay/attrs/nn.rs.html    |   14 +-
 .../rust/src/tvm/ir/relay/attrs/transform.rs.html  |    2 +-
 docs/api/rust/src/tvm/ir/relay/mod.rs.html         |  100 +-
 docs/api/rust/src/tvm/ir/source_map.rs.html        |    4 +-
 docs/api/rust/src/tvm/ir/span.rs.html              |    4 +-
 docs/api/rust/src/tvm/ir/tir.rs.html               |   30 +-
 docs/api/rust/src/tvm/ir/ty.rs.html                |  180 +-
 docs/api/rust/src/tvm/transform.rs.html            |    2 +-
 docs/api/rust/src/tvm_macros/external.rs.html      |   84 +-
 docs/api/rust/src/tvm_macros/lib.rs.html           |    8 +-
 docs/api/rust/src/tvm_macros/object.rs.html        |   62 +-
 docs/api/rust/src/tvm_rt/array.rs.html             |   28 +-
 docs/api/rust/src/tvm_rt/map.rs.html               |    4 -
 docs/api/rust/src/tvm_rt/ndarray.rs.html           |    2 +-
 docs/api/rust/src/tvm_rt/object/mod.rs.html        |  210 +-
 docs/api/rust/src/tvm_rt/object/object_ptr.rs.html |   78 +-
 docs/api/rust/src/tvm_rt/string.rs.html            |    4 +-
 docs/api/rust/src/tvm_rt/value.rs.html             |    2 -
 docs/api/rust/src/tvm_sys/datatype.rs.html         |    8 +
 docs/api/rust/test_rt_wasm32/all.html              |    4 -
 .../test_rt_wasm32/fn.__get_tvm_module_ctx.html    |    2 -
 docs/api/rust/test_rt_wasm32/fn.main.html          |    2 -
 docs/api/rust/test_rt_wasm32/index.html            |    4 -
 docs/api/rust/test_rt_wasm32/sidebar-items.js      |    1 -
 .../test_rt_wasm32/static.__tvm_module_ctx.html    |    2 -
 docs/api/rust/tvm/all.html                         |    2 +-
 docs/api/rust/tvm/context/enum.DeviceType.html     |    4 +-
 docs/api/rust/tvm/context/fn.get_device_attr.html  |    2 -
 docs/api/rust/tvm/context/index.html               |    5 +-
 docs/api/rust/tvm/context/sidebar-items.js         |    2 +-
 docs/api/rust/tvm/context/struct.Context.html      |   12 +-
 docs/api/rust/tvm/enum.DeviceType.html             |    4 +-
 docs/api/rust/tvm/enum.Error.html                  |    6 +-
 docs/api/rust/tvm/enum.NDArrayError.html           |    6 +-
 docs/api/rust/tvm/errors/enum.Error.html           |    6 +-
 docs/api/rust/tvm/errors/enum.NDArrayError.html    |    6 +-
 docs/api/rust/tvm/function/enum.ArgValue.html      |  490 +-
 docs/api/rust/tvm/function/enum.RetValue.html      |  367 +-
 .../rust/tvm/function/ffi/struct.DLContext.html    |   26 +-
 .../rust/tvm/function/ffi/struct.DLDataType.html   |   24 +-
 .../api/rust/tvm/function/ffi/struct.DLTensor.html |   12 +-
 .../rust/tvm/function/ffi/struct.TVMByteArray.html |    4 +-
 docs/api/rust/tvm/function/ffi/union.TVMValue.html |   32 +-
 docs/api/rust/tvm/function/struct.Function.html    |    8 +-
 .../rust/tvm/ir/arith/struct.ConstIntBound.html    |   11 +-
 .../tvm/ir/arith/struct.ConstIntBoundNode.html     |    5 +-
 docs/api/rust/tvm/ir/attrs/struct.Attrs.html       |   11 +-
 .../rust/tvm/ir/attrs/struct.BaseAttrsNode.html    |    5 +-
 .../tvm/ir/diagnostics/enum.DiagnosticLevel.html   |    9 +-
 .../rust/tvm/ir/diagnostics/fn.clear_renderer.html |    2 -
 .../diagnostics/fn.diagnositc_renderer_render.html |    2 -
 .../diagnostics/fn.diagnostic_context_default.html |    2 -
 .../diagnostics/fn.diagnostic_context_render.html  |    2 -
 .../tvm/ir/diagnostics/fn.diagnostic_renderer.html |    2 -
 docs/api/rust/tvm/ir/diagnostics/fn.emit.html      |    2 -
 .../rust/tvm/ir/diagnostics/fn.get_renderer.html   |    2 -
 docs/api/rust/tvm/ir/diagnostics/index.html        |    7 +-
 docs/api/rust/tvm/ir/diagnostics/sidebar-items.js  |    2 +-
 .../rust/tvm/ir/diagnostics/struct.Diagnostic.html |   31 +-
 .../ir/diagnostics/struct.DiagnosticBuilder.html   |    4 +-
 .../ir/diagnostics/struct.DiagnosticContext.html   |   39 +-
 .../diagnostics/struct.DiagnosticContextNode.html  |   13 +-
 .../tvm/ir/diagnostics/struct.DiagnosticNode.html  |   13 +-
 .../ir/diagnostics/struct.DiagnosticRenderer.html  |   33 +-
 .../diagnostics/struct.DiagnosticRendererNode.html |   13 +-
 docs/api/rust/tvm/ir/expr/fn._as_text.html         |    2 -
 docs/api/rust/tvm/ir/expr/fn.as_text.html          |    2 +-
 docs/api/rust/tvm/ir/expr/index.html               |    4 +-
 docs/api/rust/tvm/ir/expr/sidebar-items.js         |    2 +-
 docs/api/rust/tvm/ir/expr/struct.BaseExpr.html     |   31 +-
 docs/api/rust/tvm/ir/expr/struct.BaseExprNode.html |   13 +-
 docs/api/rust/tvm/ir/expr/struct.GlobalVar.html    |   31 +-
 .../api/rust/tvm/ir/expr/struct.GlobalVarNode.html |   13 +-
 docs/api/rust/tvm/ir/expr/struct.PrimExpr.html     |   34 +-
 docs/api/rust/tvm/ir/expr/struct.PrimExprNode.html |   13 +-
 docs/api/rust/tvm/ir/function/struct.BaseFunc.html |   11 +-
 .../rust/tvm/ir/function/struct.BaseFuncNode.html  |    5 +-
 docs/api/rust/tvm/ir/module/fn.module_add_def.html |    2 -
 .../tvm/ir/module/fn.module_get_global_var.html    |    2 -
 .../tvm/ir/module/fn.module_get_global_vars.html   |    2 -
 docs/api/rust/tvm/ir/module/fn.module_lookup.html  |    2 -
 .../rust/tvm/ir/module/fn.module_lookup_str.html   |    2 -
 .../rust/tvm/ir/module/fn.parse_expression.html    |    2 -
 docs/api/rust/tvm/ir/module/fn.parse_module.html   |    2 -
 docs/api/rust/tvm/ir/module/index.html             |    7 +-
 docs/api/rust/tvm/ir/module/sidebar-items.js       |    2 +-
 docs/api/rust/tvm/ir/module/struct.IRModule.html   |   13 +-
 .../rust/tvm/ir/module/struct.IRModuleNode.html    |    9 +-
 docs/api/rust/tvm/ir/op/struct.Op.html             |   11 +-
 docs/api/rust/tvm/ir/op/struct.OpNode.html         |    5 +-
 .../ir/relay/attrs/nn/struct.BatchNormAttrs.html   |   11 +-
 .../relay/attrs/nn/struct.BatchNormAttrsNode.html  |    5 +-
 .../tvm/ir/relay/attrs/nn/struct.BiasAddAttrs.html |   11 +-
 .../ir/relay/attrs/nn/struct.BiasAddAttrsNode.html |    5 +-
 .../tvm/ir/relay/attrs/nn/struct.Conv2DAttrs.html  |   11 +-
 .../ir/relay/attrs/nn/struct.Conv2DAttrsNode.html  |    5 +-
 .../tvm/ir/relay/attrs/nn/struct.DenseAttrs.html   |   11 +-
 .../ir/relay/attrs/nn/struct.DenseAttrsNode.html   |    5 +-
 .../relay/attrs/nn/struct.GlobalPool2DAttrs.html   |   11 +-
 .../attrs/nn/struct.GlobalPool2DAttrsNode.html     |    5 +-
 .../ir/relay/attrs/nn/struct.MaxPool2DAttrs.html   |   11 +-
 .../relay/attrs/nn/struct.MaxPool2DAttrsNode.html  |    5 +-
 .../tvm/ir/relay/attrs/nn/struct.SoftmaxAttrs.html |   11 +-
 .../ir/relay/attrs/nn/struct.SoftmaxAttrsNode.html |    5 +-
 .../attrs/transform/struct.ExpandDimsAttrs.html    |   11 +-
 .../transform/struct.ExpandDimsAttrsNode.html      |    5 +-
 docs/api/rust/tvm/ir/relay/index.html              |    4 +-
 docs/api/rust/tvm/ir/relay/sidebar-items.js        |    2 +-
 docs/api/rust/tvm/ir/relay/struct.Call.html        |   31 +-
 docs/api/rust/tvm/ir/relay/struct.CallNode.html    |   13 +-
 docs/api/rust/tvm/ir/relay/struct.Clause.html      |   31 +-
 docs/api/rust/tvm/ir/relay/struct.ClauseNode.html  |   13 +-
 docs/api/rust/tvm/ir/relay/struct.Constant.html    |   31 +-
 .../api/rust/tvm/ir/relay/struct.ConstantNode.html |   13 +-
 docs/api/rust/tvm/ir/relay/struct.Constructor.html |   31 +-
 .../rust/tvm/ir/relay/struct.ConstructorNode.html  |   13 +-
 docs/api/rust/tvm/ir/relay/struct.DataType.html    |   47 +
 docs/api/rust/tvm/ir/relay/struct.Expr.html        |   31 +-
 docs/api/rust/tvm/ir/relay/struct.ExprNode.html    |   13 +-
 docs/api/rust/tvm/ir/relay/struct.Function.html    |   31 +-
 .../api/rust/tvm/ir/relay/struct.FunctionNode.html |   13 +-
 docs/api/rust/tvm/ir/relay/struct.Id.html          |   31 +-
 docs/api/rust/tvm/ir/relay/struct.IdNode.html      |   13 +-
 docs/api/rust/tvm/ir/relay/struct.If.html          |   31 +-
 docs/api/rust/tvm/ir/relay/struct.IfNode.html      |   13 +-
 docs/api/rust/tvm/ir/relay/struct.Let.html         |   31 +-
 docs/api/rust/tvm/ir/relay/struct.LetNode.html     |   13 +-
 docs/api/rust/tvm/ir/relay/struct.Match.html       |   31 +-
 docs/api/rust/tvm/ir/relay/struct.MatchNode.html   |   13 +-
 docs/api/rust/tvm/ir/relay/struct.Pattern.html     |   31 +-
 .../tvm/ir/relay/struct.PatternConstructor.html    |   31 +-
 .../ir/relay/struct.PatternConstructorNode.html    |   13 +-
 docs/api/rust/tvm/ir/relay/struct.PatternNode.html |   13 +-
 .../api/rust/tvm/ir/relay/struct.PatternTuple.html |   31 +-
 .../rust/tvm/ir/relay/struct.PatternTupleNode.html |   13 +-
 docs/api/rust/tvm/ir/relay/struct.PatternVar.html  |   31 +-
 .../rust/tvm/ir/relay/struct.PatternVarNode.html   |   13 +-
 .../rust/tvm/ir/relay/struct.PatternWildcard.html  |   31 +-
 .../tvm/ir/relay/struct.PatternWildcardNode.html   |   13 +-
 docs/api/rust/tvm/ir/relay/struct.RefCreate.html   |   31 +-
 .../rust/tvm/ir/relay/struct.RefCreateNode.html    |   13 +-
 docs/api/rust/tvm/ir/relay/struct.RefRead.html     |   31 +-
 docs/api/rust/tvm/ir/relay/struct.RefReadNode.html |   13 +-
 docs/api/rust/tvm/ir/relay/struct.RefWrite.html    |   31 +-
 .../api/rust/tvm/ir/relay/struct.RefWriteNode.html |   13 +-
 docs/api/rust/tvm/ir/relay/struct.Tuple.html       |   31 +-
 .../api/rust/tvm/ir/relay/struct.TupleGetItem.html |   31 +-
 .../rust/tvm/ir/relay/struct.TupleGetItemNode.html |   13 +-
 docs/api/rust/tvm/ir/relay/struct.TupleNode.html   |   13 +-
 docs/api/rust/tvm/ir/relay/struct.Var.html         |   31 +-
 docs/api/rust/tvm/ir/relay/struct.VarNode.html     |   13 +-
 docs/api/rust/tvm/ir/source_map/struct.Source.html |   11 +-
 .../rust/tvm/ir/source_map/struct.SourceMap.html   |   11 +-
 .../tvm/ir/source_map/struct.SourceMapNode.html    |    5 +-
 .../rust/tvm/ir/source_map/struct.SourceNode.html  |    5 +-
 docs/api/rust/tvm/ir/span/struct.SourceName.html   |   11 +-
 .../rust/tvm/ir/span/struct.SourceNameNode.html    |    5 +-
 docs/api/rust/tvm/ir/span/struct.Span.html         |   11 +-
 docs/api/rust/tvm/ir/span/struct.SpanNode.html     |    5 +-
 docs/api/rust/tvm/ir/tir/index.html                |    2 +-
 docs/api/rust/tvm/ir/tir/struct.Add.html           |   11 +-
 docs/api/rust/tvm/ir/tir/struct.AddNode.html       |    5 +-
 docs/api/rust/tvm/ir/tir/struct.And.html           |   11 +-
 docs/api/rust/tvm/ir/tir/struct.AndNode.html       |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Cast.html          |   11 +-
 docs/api/rust/tvm/ir/tir/struct.CastNode.html      |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Div.html           |   11 +-
 docs/api/rust/tvm/ir/tir/struct.DivNode.html       |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Eq.html            |   11 +-
 docs/api/rust/tvm/ir/tir/struct.EqNode.html        |    5 +-
 docs/api/rust/tvm/ir/tir/struct.FloorDiv.html      |   11 +-
 docs/api/rust/tvm/ir/tir/struct.FloorDivNode.html  |    5 +-
 docs/api/rust/tvm/ir/tir/struct.FloorMod.html      |   11 +-
 docs/api/rust/tvm/ir/tir/struct.FloorModNode.html  |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Ge.html            |   11 +-
 docs/api/rust/tvm/ir/tir/struct.GeNode.html        |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Gt.html            |   11 +-
 docs/api/rust/tvm/ir/tir/struct.GtNode.html        |    5 +-
 docs/api/rust/tvm/ir/tir/struct.IntImm.html        |   14 +-
 docs/api/rust/tvm/ir/tir/struct.IntImmNode.html    |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Le.html            |   11 +-
 docs/api/rust/tvm/ir/tir/struct.LeNode.html        |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Let.html           |   11 +-
 docs/api/rust/tvm/ir/tir/struct.LetNode.html       |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Lt.html            |   11 +-
 docs/api/rust/tvm/ir/tir/struct.LtNode.html        |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Max.html           |   11 +-
 docs/api/rust/tvm/ir/tir/struct.MaxNode.html       |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Min.html           |   11 +-
 docs/api/rust/tvm/ir/tir/struct.MinNode.html       |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Mod.html           |   11 +-
 docs/api/rust/tvm/ir/tir/struct.ModNode.html       |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Mul.html           |   11 +-
 docs/api/rust/tvm/ir/tir/struct.MulNode.html       |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Ne.html            |   11 +-
 docs/api/rust/tvm/ir/tir/struct.NeNode.html        |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Not.html           |   11 +-
 docs/api/rust/tvm/ir/tir/struct.NotNode.html       |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Or.html            |   11 +-
 docs/api/rust/tvm/ir/tir/struct.OrNode.html        |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Ramp.html          |   11 +-
 docs/api/rust/tvm/ir/tir/struct.RampNode.html      |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Select.html        |   11 +-
 docs/api/rust/tvm/ir/tir/struct.SelectNode.html    |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Sub.html           |   11 +-
 docs/api/rust/tvm/ir/tir/struct.SubNode.html       |    5 +-
 docs/api/rust/tvm/ir/tir/struct.Var.html           |   11 +-
 docs/api/rust/tvm/ir/tir/struct.VarNode.html       |    5 +-
 docs/api/rust/tvm/ir/ty/enum.TypeKind.html         |    9 +-
 docs/api/rust/tvm/ir/ty/index.html                 |    5 +-
 docs/api/rust/tvm/ir/ty/sidebar-items.js           |    2 +-
 docs/api/rust/tvm/ir/ty/struct.BaseTensorType.html |   31 +-
 .../rust/tvm/ir/ty/struct.BaseTensorTypeNode.html  |   13 +-
 docs/api/rust/tvm/ir/ty/struct.FuncType.html       |   31 +-
 docs/api/rust/tvm/ir/ty/struct.FuncTypeNode.html   |   13 +-
 docs/api/rust/tvm/ir/ty/struct.GlobalTypeVar.html  |   31 +-
 .../rust/tvm/ir/ty/struct.GlobalTypeVarNode.html   |   15 +-
 docs/api/rust/tvm/ir/ty/struct.IncompleteType.html |   31 +-
 .../rust/tvm/ir/ty/struct.IncompleteTypeNode.html  |   13 +-
 docs/api/rust/tvm/ir/ty/struct.PointerType.html    |   31 +-
 .../api/rust/tvm/ir/ty/struct.PointerTypeNode.html |   13 +-
 docs/api/rust/tvm/ir/ty/struct.PrimType.html       |   31 +-
 docs/api/rust/tvm/ir/ty/struct.PrimTypeNode.html   |   13 +-
 docs/api/rust/tvm/ir/ty/struct.RefType.html        |   31 +-
 .../rust/tvm/ir/ty/struct.RelayRefTypeNode.html    |   13 +-
 docs/api/rust/tvm/ir/ty/struct.TensorType.html     |   31 +-
 docs/api/rust/tvm/ir/ty/struct.TensorTypeNode.html |   13 +-
 docs/api/rust/tvm/ir/ty/struct.TupleType.html      |   31 +-
 docs/api/rust/tvm/ir/ty/struct.TupleTypeNode.html  |   13 +-
 docs/api/rust/tvm/ir/ty/struct.Type.html           |   31 +-
 docs/api/rust/tvm/ir/ty/struct.TypeConstraint.html |   31 +-
 .../rust/tvm/ir/ty/struct.TypeConstraintNode.html  |   13 +-
 .../{struct.FuncType.html => struct.TypeData.html} |   33 +-
 ....FuncTypeNode.html => struct.TypeDataNode.html} |   34 +-
 docs/api/rust/tvm/ir/ty/struct.TypeNode.html       |   13 +-
 docs/api/rust/tvm/ir/ty/struct.TypeVar.html        |   31 +-
 docs/api/rust/tvm/ir/ty/struct.TypeVarNode.html    |   23 +-
 docs/api/rust/tvm/module/fn.load_from_file.html    |    2 -
 docs/api/rust/tvm/module/fn.runtime_enabled.html   |    2 -
 docs/api/rust/tvm/module/index.html                |    5 +-
 docs/api/rust/tvm/module/sidebar-items.js          |    2 +-
 docs/api/rust/tvm/module/struct.Module.html        |   18 +-
 docs/api/rust/tvm/ndarray/struct.NDArray.html      |   17 +-
 .../rust/tvm/ndarray/struct.NDArrayContainer.html  |    9 +-
 docs/api/rust/tvm/ndarray/trait.Num32.html         |    2 +-
 .../rust/tvm/runtime/array/fn.array_get_item.html  |    2 -
 docs/api/rust/tvm/runtime/array/fn.array_size.html |    2 -
 docs/api/rust/tvm/runtime/array/index.html         |    5 +-
 docs/api/rust/tvm/runtime/array/sidebar-items.js   |    2 +-
 docs/api/rust/tvm/runtime/array/struct.Array.html  |   20 +-
 .../rust/tvm/runtime/array/struct.IntoIter.html    |    4 +-
 .../rust/tvm/runtime/context/enum.DeviceType.html  |    4 +-
 .../tvm/runtime/context/fn.get_device_attr.html    |    2 -
 docs/api/rust/tvm/runtime/context/index.html       |    5 +-
 docs/api/rust/tvm/runtime/context/sidebar-items.js |    2 +-
 .../rust/tvm/runtime/context/struct.Context.html   |   12 +-
 docs/api/rust/tvm/runtime/enum.ArgValue.html       |  490 +-
 docs/api/rust/tvm/runtime/enum.DeviceType.html     |    4 +-
 docs/api/rust/tvm/runtime/enum.Error.html          |    6 +-
 docs/api/rust/tvm/runtime/enum.NDArrayError.html   |    6 +-
 docs/api/rust/tvm/runtime/enum.RetValue.html       |  367 +-
 docs/api/rust/tvm/runtime/errors/enum.Error.html   |    6 +-
 .../rust/tvm/runtime/errors/enum.NDArrayError.html |    6 +-
 docs/api/rust/tvm/runtime/fn.debug_print.html      |    2 +-
 docs/api/rust/tvm/runtime/fn.structural_equal.html |    2 -
 docs/api/rust/tvm/runtime/fn.structural_hash.html  |    2 -
 .../rust/tvm/runtime/function/enum.ArgValue.html   |  490 +-
 .../rust/tvm/runtime/function/enum.RetValue.html   |  367 +-
 .../tvm/runtime/function/ffi/struct.DLContext.html |   26 +-
 .../runtime/function/ffi/struct.DLDataType.html    |   24 +-
 .../tvm/runtime/function/ffi/struct.DLTensor.html  |   12 +-
 .../runtime/function/ffi/struct.TVMByteArray.html  |    4 +-
 .../tvm/runtime/function/ffi/union.TVMValue.html   |   32 +-
 .../rust/tvm/runtime/function/struct.Function.html |    8 +-
 docs/api/rust/tvm/runtime/index.html               |    2 +-
 docs/api/rust/tvm/runtime/macro.external.html      |    2 +-
 .../rust/tvm/runtime/map/fn.array_get_item.html    |    2 -
 docs/api/rust/tvm/runtime/map/fn.map_count.html    |    2 -
 docs/api/rust/tvm/runtime/map/fn.map_get_item.html |    2 -
 docs/api/rust/tvm/runtime/map/fn.map_items.html    |    2 -
 docs/api/rust/tvm/runtime/map/fn.map_size.html     |    2 -
 docs/api/rust/tvm/runtime/map/index.html           |    5 +-
 docs/api/rust/tvm/runtime/map/sidebar-items.js     |    2 +-
 docs/api/rust/tvm/runtime/map/struct.IntoIter.html |    6 +-
 docs/api/rust/tvm/runtime/map/struct.Map.html      |   22 +-
 .../rust/tvm/runtime/module/fn.load_from_file.html |    2 -
 .../tvm/runtime/module/fn.runtime_enabled.html     |    2 -
 docs/api/rust/tvm/runtime/module/index.html        |    5 +-
 docs/api/rust/tvm/runtime/module/sidebar-items.js  |    2 +-
 .../api/rust/tvm/runtime/module/struct.Module.html |   18 +-
 .../rust/tvm/runtime/ndarray/struct.NDArray.html   |   17 +-
 .../runtime/ndarray/struct.NDArrayContainer.html   |    9 +-
 docs/api/rust/tvm/runtime/ndarray/trait.Num32.html |    2 +-
 .../rust/tvm/runtime/object/fn.debug_print.html    |    2 +-
 .../tvm/runtime/object/fn.structural_equal.html    |    2 -
 .../tvm/runtime/object/fn.structural_hash.html     |    2 -
 docs/api/rust/tvm/runtime/object/index.html        |    2 +-
 docs/api/rust/tvm/runtime/object/sidebar-items.js  |    2 +-
 .../api/rust/tvm/runtime/object/struct.Object.html |   12 +-
 .../rust/tvm/runtime/object/struct.ObjectPtr.html  |  118 +-
 .../rust/tvm/runtime/object/struct.ObjectRef.html  |   31 +-
 .../rust/tvm/runtime/object/trait.IsObject.html    |    4 +-
 .../rust/tvm/runtime/object/trait.IsObjectRef.html |    4 +-
 docs/api/rust/tvm/runtime/sidebar-items.js         |    2 +-
 .../api/rust/tvm/runtime/string/struct.String.html |   36 +-
 .../rust/tvm/runtime/string/struct.StringObj.html  |    9 +-
 docs/api/rust/tvm/runtime/struct.ByteArray.html    |   12 +-
 docs/api/rust/tvm/runtime/struct.Context.html      |   12 +-
 docs/api/rust/tvm/runtime/struct.DataType.html     |   24 +-
 docs/api/rust/tvm/runtime/struct.Function.html     |    8 +-
 docs/api/rust/tvm/runtime/struct.Module.html       |   18 +-
 docs/api/rust/tvm/runtime/struct.NDArray.html      |   17 +-
 docs/api/rust/tvm/runtime/struct.Object.html       |   12 +-
 docs/api/rust/tvm/runtime/struct.ObjectPtr.html    |  118 +-
 docs/api/rust/tvm/runtime/struct.ObjectRef.html    |   31 +-
 docs/api/rust/tvm/runtime/struct.String.html       |   36 +-
 docs/api/rust/tvm/runtime/struct.StringObj.html    |    9 +-
 docs/api/rust/tvm/runtime/trait.IsObject.html      |    4 +-
 docs/api/rust/tvm/runtime/trait.IsObjectRef.html   |    4 +-
 docs/api/rust/tvm/struct.Context.html              |   12 +-
 docs/api/rust/tvm/struct.DataType.html             |   24 +-
 docs/api/rust/tvm/struct.Function.html             |    8 +-
 docs/api/rust/tvm/struct.Module.html               |   18 +-
 docs/api/rust/tvm/struct.NDArray.html              |   17 +-
 .../rust/tvm/transform/fn.create_func_pass.html    |    2 -
 docs/api/rust/tvm/transform/index.html             |    2 +-
 docs/api/rust/tvm/transform/sidebar-items.js       |    2 +-
 docs/api/rust/tvm/transform/struct.PassInfo.html   |   11 +-
 .../rust/tvm/transform/struct.PassInfoNode.html    |    5 +-
 docs/api/rust/tvm_graph_rt/enum.ArgValue.html      |   92 +-
 docs/api/rust/tvm_graph_rt/enum.RetValue.html      |   52 +-
 .../rust/tvm_graph_rt/ffi/struct.DLContext.html    |   26 +-
 .../rust/tvm_graph_rt/ffi/struct.DLDataType.html   |   22 +-
 .../api/rust/tvm_graph_rt/ffi/struct.DLTensor.html |   12 +-
 .../rust/tvm_graph_rt/ffi/struct.TVMByteArray.html |    4 +-
 docs/api/rust/tvm_graph_rt/ffi/union.TVMValue.html |   32 +-
 .../api/rust/tvm_graph_rt/macro.import_module.html |    2 +-
 .../tvm_graph_rt/packed_func/enum.ArgValue.html    |   92 +-
 .../tvm_graph_rt/packed_func/enum.RetValue.html    |   52 +-
 .../tvm_graph_rt/packed_func/union.TVMValue.html   |   32 +-
 docs/api/rust/tvm_graph_rt/struct.DLTensor.html    |   12 +-
 docs/api/rust/tvm_graph_rt/struct.Entry.html       |    6 +-
 docs/api/rust/tvm_graph_rt/struct.Graph.html       |    6 +-
 docs/api/rust/tvm_graph_rt/struct.Node.html        |    6 +-
 docs/api/rust/tvm_graph_rt/union.TVMValue.html     |   32 +-
 docs/api/rust/tvm_macros/derive.Object.html        |    3 +-
 docs/api/rust/tvm_macros/index.html                |    2 +-
 docs/api/rust/tvm_macros/macro.external.html       |    2 +-
 docs/api/rust/tvm_macros/macro.import_module.html  |    2 +-
 docs/api/rust/tvm_rt/all.html                      |    2 +-
 docs/api/rust/tvm_rt/array/fn.array_get_item.html  |    2 -
 docs/api/rust/tvm_rt/array/fn.array_size.html      |    2 -
 docs/api/rust/tvm_rt/array/index.html              |    7 +-
 docs/api/rust/tvm_rt/array/sidebar-items.js        |    2 +-
 docs/api/rust/tvm_rt/array/struct.Array.html       |   20 +-
 docs/api/rust/tvm_rt/array/struct.IntoIter.html    |    4 +-
 docs/api/rust/tvm_rt/context/enum.DeviceType.html  |    4 +-
 .../rust/tvm_rt/context/fn.get_device_attr.html    |    2 -
 docs/api/rust/tvm_rt/context/index.html            |    5 +-
 docs/api/rust/tvm_rt/context/sidebar-items.js      |    2 +-
 docs/api/rust/tvm_rt/context/struct.Context.html   |   12 +-
 docs/api/rust/tvm_rt/enum.ArgValue.html            |  130 +-
 docs/api/rust/tvm_rt/enum.DeviceType.html          |    4 +-
 docs/api/rust/tvm_rt/enum.RetValue.html            |   98 +-
 docs/api/rust/tvm_rt/function/enum.ArgValue.html   |  130 +-
 docs/api/rust/tvm_rt/function/enum.RetValue.html   |   98 +-
 .../rust/tvm_rt/function/ffi/struct.DLContext.html |   26 +-
 .../tvm_rt/function/ffi/struct.DLDataType.html     |   24 +-
 .../rust/tvm_rt/function/ffi/struct.DLTensor.html  |   12 +-
 .../tvm_rt/function/ffi/struct.TVMByteArray.html   |    4 +-
 .../rust/tvm_rt/function/ffi/union.TVMValue.html   |   32 +-
 docs/api/rust/tvm_rt/macro.external.html           |    2 +-
 docs/api/rust/tvm_rt/map/fn.array_get_item.html    |    2 -
 docs/api/rust/tvm_rt/map/fn.map_count.html         |    2 -
 docs/api/rust/tvm_rt/map/fn.map_get_item.html      |    2 -
 docs/api/rust/tvm_rt/map/fn.map_items.html         |    2 -
 docs/api/rust/tvm_rt/map/fn.map_size.html          |    2 -
 docs/api/rust/tvm_rt/map/index.html                |    7 +-
 docs/api/rust/tvm_rt/map/sidebar-items.js          |    2 +-
 docs/api/rust/tvm_rt/map/struct.IntoIter.html      |    6 +-
 docs/api/rust/tvm_rt/map/struct.Map.html           |   24 +-
 docs/api/rust/tvm_rt/module/fn.load_from_file.html |    2 -
 .../api/rust/tvm_rt/module/fn.runtime_enabled.html |    2 -
 docs/api/rust/tvm_rt/module/index.html             |    5 +-
 docs/api/rust/tvm_rt/module/sidebar-items.js       |    2 +-
 docs/api/rust/tvm_rt/module/struct.Module.html     |   18 +-
 docs/api/rust/tvm_rt/ndarray/struct.NDArray.html   |   11 +-
 .../tvm_rt/ndarray/struct.NDArrayContainer.html    |    5 +-
 docs/api/rust/tvm_rt/object/fn.debug_print.html    |    2 +-
 .../rust/tvm_rt/object/fn.structural_equal.html    |    2 -
 .../api/rust/tvm_rt/object/fn.structural_hash.html |    2 -
 docs/api/rust/tvm_rt/object/index.html             |    4 +-
 docs/api/rust/tvm_rt/object/sidebar-items.js       |    2 +-
 docs/api/rust/tvm_rt/object/struct.Object.html     |   12 +-
 docs/api/rust/tvm_rt/object/struct.ObjectPtr.html  |   33 +-
 docs/api/rust/tvm_rt/object/struct.ObjectRef.html  |   31 +-
 docs/api/rust/tvm_rt/object/trait.IsObject.html    |    4 +-
 docs/api/rust/tvm_rt/object/trait.IsObjectRef.html |    4 +-
 docs/api/rust/tvm_rt/string/index.html             |    2 +-
 docs/api/rust/tvm_rt/string/struct.String.html     |   20 +-
 docs/api/rust/tvm_rt/string/struct.StringObj.html  |    7 +-
 docs/api/rust/tvm_rt/struct.ByteArray.html         |   12 +-
 docs/api/rust/tvm_rt/struct.Context.html           |   12 +-
 docs/api/rust/tvm_rt/struct.DataType.html          |   24 +-
 docs/api/rust/tvm_rt/value/index.html              |    2 +-
 .../tvm_sys/datatype/enum.ParseDataTypeError.html  |    8 +-
 docs/api/rust/tvm_sys/datatype/index.html          |    2 +-
 .../api/rust/tvm_sys/datatype/struct.DataType.html |   24 +-
 docs/api/rust/tvm_sys/ffi/struct.DLDataType.html   |    6 +-
 .../rust/tvm_sys/packed_func/enum.RetValue.html    |    6 +-
 docs/api/typedoc/assets/js/search.json             |    2 +-
 docs/api/typedoc/classes/bytestreamreader.html     |   12 +-
 docs/api/typedoc/classes/cachedcallstack.html      |   34 +-
 docs/api/typedoc/classes/dlcontext.html            |   10 +-
 docs/api/typedoc/classes/dldatatype.html           |   12 +-
 docs/api/typedoc/classes/environment.html          |   12 +-
 docs/api/typedoc/classes/ffilibrary.html           |   20 +-
 docs/api/typedoc/classes/graphruntime.html         |   16 +-
 docs/api/typedoc/classes/instance.html             |   40 +-
 docs/api/typedoc/classes/memory.html               |   34 +-
 docs/api/typedoc/classes/module.html               |   10 +-
 docs/api/typedoc/classes/ndarray.html              |   22 +-
 docs/api/typedoc/classes/packedfunccell.html       |    6 +-
 docs/api/typedoc/classes/rpcserver.html            |   14 +-
 docs/api/typedoc/classes/scalar.html               |    6 +-
 docs/api/typedoc/classes/webgpucontext.html        |   12 +-
 docs/api/typedoc/enums/argtypecode.html            |   30 +-
 docs/api/typedoc/enums/aynccallbackcode.html       |    4 +-
 docs/api/typedoc/enums/dldatatypecode.html         |    8 +-
 docs/api/typedoc/enums/rpcserverstate.html         |   12 +-
 docs/api/typedoc/enums/sizeof.html                 |   18 +-
 docs/api/typedoc/index.html                        |  114 +-
 docs/api/typedoc/interfaces/disposable.html        |    2 +-
 docs/api/typedoc/interfaces/functioninfo.html      |    6 +-
 docs/api/typedoc/interfaces/libraryprovider.html   |    4 +-
 docs/deploy/android.html                           |    1 +
 docs/deploy/arm_compute_lib.html                   |    3 +-
 docs/deploy/cpp_deploy.html                        |    1 +
 docs/deploy/hls.html                               |    1 +
 docs/deploy/index.html                             |    7 +
 docs/deploy/integrate.html                         |    1 +
 docs/deploy/tensorrt.html                          |    5 +-
 docs/deploy/vitis_ai.html                          | 1069 ++++
 docs/dev/how_to.html                               |    4 +-
 docs/genindex.html                                 |    6 +
 docs/langref/relay_op.html                         |    2 +-
 docs/objects.inv                                   |  Bin 17289 -> 17442 bytes
 docs/searchindex.js                                |    2 +-
 .../auto_scheduler/sg_execution_times.html         |    7 +-
 .../auto_scheduler/tune_conv2d_layer_cuda.html     | 1267 +----
 docs/tutorials/auto_scheduler/tune_matmul_x86.html |   74 +-
 .../auto_scheduler/tune_network_cuda.html          |  684 +++
 docs/tutorials/autotvm/sg_execution_times.html     |   14 +-
 docs/tutorials/autotvm/tune_conv2d_cuda.html       |   47 +-
 docs/tutorials/autotvm/tune_relay_arm.html         |    6 +-
 docs/tutorials/autotvm/tune_relay_cuda.html        |   14 +-
 docs/tutorials/autotvm/tune_relay_mobile_gpu.html  |    6 +-
 docs/tutorials/autotvm/tune_relay_x86.html         |    6 +-
 docs/tutorials/autotvm/tune_simple_template.html   |   22 +-
 docs/tutorials/dev/bring_your_own_datatypes.html   |   14 +-
 docs/tutorials/dev/low_level_custom_pass.html      |   76 +-
 docs/tutorials/dev/sg_execution_times.html         |    8 +-
 docs/tutorials/dev/use_pass_infra.html             | 3121 +-----------
 docs/tutorials/frontend/build_gcn.html             |   34 +-
 .../frontend/deploy_model_on_android.html          |   38 +-
 docs/tutorials/frontend/deploy_model_on_rasp.html  |   32 +-
 .../frontend/deploy_object_detection_pytorch.html  |   36 +-
 docs/tutorials/frontend/deploy_prequantized.html   |   34 +-
 .../frontend/deploy_prequantized_tflite.html       |   36 +-
 docs/tutorials/frontend/deploy_quantized.html      |   36 +-
 docs/tutorials/frontend/deploy_sparse.html         |   32 +-
 docs/tutorials/frontend/deploy_ssd_gluoncv.html    |  156 +-
 docs/tutorials/frontend/from_caffe2.html           |   32 +-
 docs/tutorials/frontend/from_coreml.html           |   32 +-
 docs/tutorials/frontend/from_darknet.html          |   34 +-
 docs/tutorials/frontend/from_keras.html            |   32 +-
 docs/tutorials/frontend/from_mxnet.html            |   36 +-
 docs/tutorials/frontend/from_onnx.html             |   43 +-
 docs/tutorials/frontend/from_pytorch.html          |   43 +-
 docs/tutorials/frontend/from_tensorflow.html       | 1999 +++++++-
 docs/tutorials/frontend/from_tflite.html           |   34 +-
 docs/tutorials/frontend/sg_execution_times.html    |   40 +-
 docs/tutorials/frontend/using_external_lib.html    |   52 +-
 .../get_started/cross_compilation_and_rpc.html     |   14 +-
 docs/tutorials/get_started/relay_quick_start.html  |  123 +-
 docs/tutorials/get_started/sg_execution_times.html |   10 +-
 .../get_started/tensor_expr_get_started.html       |   14 +-
 .../get_started/tvmc_command_line_driver.html      |   18 +-
 docs/tutorials/index.html                          |  290 +-
 docs/tutorials/language/extern_op.html             |   20 +-
 docs/tutorials/language/intrin_math.html           |   20 +-
 docs/tutorials/language/reduction.html             |  185 +-
 docs/tutorials/language/scan.html                  |  119 +-
 docs/tutorials/language/schedule_primitives.html   |  383 +-
 docs/tutorials/language/sg_execution_times.html    |   18 +-
 docs/tutorials/language/tedd.html                  |   20 +-
 docs/tutorials/language/tensorize.html             |  164 +-
 docs/tutorials/language/tuple_inputs.html          |  135 +-
 .../micro_reference_vm.html}                       |  156 +-
 docs/tutorials/micro/micro_tflite.html             |   10 +-
 docs/tutorials/micro/sg_execution_times.html       |    5 +-
 docs/tutorials/optimize/opt_conv_cuda.html         |   12 +-
 docs/tutorials/optimize/opt_conv_tensorcore.html   |   78 +-
 docs/tutorials/optimize/opt_gemm.html              |  261 +-
 .../optimize/opt_matmul_auto_tensorcore.html       |    2 +-
 docs/tutorials/optimize/sg_execution_times.html    |   10 +-
 docs/tutorials/topi/intro_topi.html                |  377 +-
 docs/tutorials/topi/sg_execution_times.html        |    4 +-
 docs/vta/tutorials/autotvm/sg_execution_times.html |    4 +-
 docs/vta/tutorials/autotvm/tune_relay_vta.html     |  192 +-
 .../tutorials/frontend/deploy_classification.html  |   28 +-
 .../vta/tutorials/frontend/sg_execution_times.html |    4 +-
 docs/vta/tutorials/index.html                      |   32 +-
 docs/vta/tutorials/matrix_multiply.html            |  113 +-
 docs/vta/tutorials/optimize/convolution_opt.html   |  119 +-
 .../tutorials/optimize/matrix_multiply_opt.html    |  111 +-
 .../vta/tutorials/optimize/sg_execution_times.html |    6 +-
 docs/vta/tutorials/sg_execution_times.html         |    6 +-
 docs/vta/tutorials/vta_get_started.html            |   76 +-
 1284 files changed, 37870 insertions(+), 37171 deletions(-)

diff --git a/docs/_downloads/08e39628455fe618afd9eb5b958a433e/micro_reference_vm.ipynb b/docs/_downloads/08e39628455fe618afd9eb5b958a433e/micro_reference_vm.ipynb
new file mode 100644
index 0000000..883348e
--- /dev/null
+++ b/docs/_downloads/08e39628455fe618afd9eb5b958a433e/micro_reference_vm.ipynb
@@ -0,0 +1,43 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%matplotlib inline"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# microTVM Reference Virtual Machines\n\n**Author**: `Andrew Reusch <ar...@octoml.ai>`_\n\nThis tutorial explains how to launch microTVM Reference Virtual Machines. You can use these to\ndevelop on real physical hardware without needing to individually install the microTVM\ndependencies. These are also particularly useful when trying to reproduce behavior with\nmicroTVM, such as when filing bug reports.\n\nmicroTVM is the effort to allow TVM to build and execute models on ba [...]
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.6.12"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
diff --git a/docs/_downloads/0bb862dbb3a4c434477f93fe2c147fbb/tune_simple_template.py b/docs/_downloads/0bb862dbb3a4c434477f93fe2c147fbb/tune_simple_template.py
index b5167b3..4c5c7da 100644
--- a/docs/_downloads/0bb862dbb3a4c434477f93fe2c147fbb/tune_simple_template.py
+++ b/docs/_downloads/0bb862dbb3a4c434477f93fe2c147fbb/tune_simple_template.py
@@ -59,7 +59,7 @@ import sys
 
 import numpy as np
 import tvm
-from tvm import te
+from tvm import te, testing
 
 # the module is called `autotvm`
 from tvm import autotvm
diff --git a/docs/_downloads/0d95a85fc279fdff660608ef305b9107/tune_simple_template.ipynb b/docs/_downloads/0d95a85fc279fdff660608ef305b9107/tune_simple_template.ipynb
index b49ccad..30db8c5 100644
--- a/docs/_downloads/0d95a85fc279fdff660608ef305b9107/tune_simple_template.ipynb
+++ b/docs/_downloads/0d95a85fc279fdff660608ef305b9107/tune_simple_template.ipynb
@@ -33,7 +33,7 @@
       },
       "outputs": [],
       "source": [
-        "import logging\nimport sys\n\nimport numpy as np\nimport tvm\nfrom tvm import te\n\n# the module is called `autotvm`\nfrom tvm import autotvm"
+        "import logging\nimport sys\n\nimport numpy as np\nimport tvm\nfrom tvm import te, testing\n\n# the module is called `autotvm`\nfrom tvm import autotvm"
       ]
     },
     {
diff --git a/docs/_downloads/18fb1ab3ed0a0c9f304520f2beaf4fd6/tvmc_command_line_driver.py b/docs/_downloads/18fb1ab3ed0a0c9f304520f2beaf4fd6/tvmc_command_line_driver.py
index d844de5..bcdf03e 100644
--- a/docs/_downloads/18fb1ab3ed0a0c9f304520f2beaf4fd6/tvmc_command_line_driver.py
+++ b/docs/_downloads/18fb1ab3ed0a0c9f304520f2beaf4fd6/tvmc_command_line_driver.py
@@ -246,10 +246,10 @@ if os.path.exists(output_file):
     with np.load(output_file) as data:
         scores = softmax(data["output_0"])
         scores = np.squeeze(scores)
-        scores = np.argsort(scores)[::-1]
+        ranks = np.argsort(scores)[::-1]
 
-        for i in scores[0:5]:
-            print("class='%s' with probability=%f" % (labels[i], scores[i]))
+        for rank in ranks[0:5]:
+            print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
 
 
 ########################################################################
diff --git a/docs/_downloads/2354a24ad8bc07194943c49f2fb48874/tune_conv2d_cuda.ipynb b/docs/_downloads/2354a24ad8bc07194943c49f2fb48874/tune_conv2d_cuda.ipynb
index 8353b85..06ec1b5 100644
--- a/docs/_downloads/2354a24ad8bc07194943c49f2fb48874/tune_conv2d_cuda.ipynb
+++ b/docs/_downloads/2354a24ad8bc07194943c49f2fb48874/tune_conv2d_cuda.ipynb
@@ -33,7 +33,7 @@
       },
       "outputs": [],
       "source": [
-        "import logging\nimport sys\nimport numpy as np\n\nimport tvm\nfrom tvm import te\nfrom tvm import topi\nfrom tvm.topi.testing import conv2d_nchw_python\n\nfrom tvm import autotvm"
+        "import logging\nimport sys\nimport numpy as np\n\nimport tvm\nfrom tvm import te, topi, testing\nfrom tvm.topi.testing import conv2d_nchw_python\n\nfrom tvm import autotvm"
       ]
     },
     {
diff --git a/docs/_downloads/272a5a893d007658546dc0eaf0a7aeed/tune_relay_cuda.py b/docs/_downloads/272a5a893d007658546dc0eaf0a7aeed/tune_relay_cuda.py
index f9b8921..9140713 100644
--- a/docs/_downloads/272a5a893d007658546dc0eaf0a7aeed/tune_relay_cuda.py
+++ b/docs/_downloads/272a5a893d007658546dc0eaf0a7aeed/tune_relay_cuda.py
@@ -64,12 +64,9 @@ import os
 import numpy as np
 
 import tvm
-from tvm import te
-from tvm import autotvm
-from tvm import relay
+from tvm import relay, autotvm
 import tvm.relay.testing
 from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
-from tvm.contrib.utils import tempdir
 import tvm.contrib.graph_runtime as runtime
 
 #################################################################
@@ -102,7 +99,7 @@ def get_network(name, batch_size):
             batch_size=batch_size, version="1.1", dtype=dtype
         )
     elif name == "inception_v3":
-        input_shape = (1, 3, 299, 299)
+        input_shape = (batch_size, 3, 299, 299)
         mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
     elif name == "mxnet":
         # an example for mxnet model
@@ -239,11 +236,6 @@ def tune_and_evaluate(tuning_opt):
         with tvm.transform.PassContext(opt_level=3):
             lib = relay.build_module.build(mod, target=target, params=params)
 
-        # export library
-        tmp = tempdir()
-        filename = "net.tar"
-        lib.export_library(tmp.relpath(filename))
-
         # load parameters
         ctx = tvm.context(str(target), 0)
         module = runtime.GraphModule(lib["default"](ctx))
@@ -323,6 +315,7 @@ def tune_and_evaluate(tuning_opt):
 #################################################################
 # Scale up measurement by using multiple devices
 # ----------------------------------------------
+# .. _tutorials-autotvm-rpc-tracker:
 #
 # If you have multiple devices, you can use all of them for measurement.
 # TVM uses the RPC Tracker to manage distributed devices.
diff --git a/docs/_downloads/2771a7fc8bf8eeb7788823ff349aacc0/tune_network_cuda.py b/docs/_downloads/2771a7fc8bf8eeb7788823ff349aacc0/tune_network_cuda.py
new file mode 100644
index 0000000..9eb5d5c
--- /dev/null
+++ b/docs/_downloads/2771a7fc8bf8eeb7788823ff349aacc0/tune_network_cuda.py
@@ -0,0 +1,302 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""
+Auto-tuning a Neural Network for NVIDIA GPU
+===========================================
+**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_
+
+Auto-tuning for specific devices and workloads is critical for getting the
+best performance. This is a tutorial on how to tune a whole neural
+network for NVIDIA GPU with the auto-scheduler.
+
+To auto-tune a neural network, we partition the network into small subgraphs and 
+tune them independently. Each subgraph is treated as one search task.
+A task scheduler slices the time and dynamically allocates time resources to
+these tasks. The task scheduler predicts the impact of each task on the end-to-end
+execution time and prioritizes the one that can reduce the execution time the most.
+
+For each subgraph, we use the compute declaration in :code:`tvm/python/topi` to
+get the computational DAG in the tensor expression form.
+We then use the auto-scheduler to construct a search space of this DAG and search
+for good schedules (low-level optimizations).
+
+Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which relies on
+manual templates to define the search space, the auto-scheduler does not require any
+schedule templates. In other words, the auto-scheduler only uses the compute declarations
+in :code:`tvm/python/topi` while does not use existing schedule templates.
+
+Note that this tutorial will not run on Windows or recent versions of macOS. To
+get it to run, you will need to wrap the body of this tutorial in a :code:`if
+__name__ == "__main__":` block.
+"""
+
+import numpy as np
+
+import tvm
+from tvm import relay, auto_scheduler
+import tvm.relay.testing
+from tvm.contrib import graph_runtime
+
+#################################################################
+# Define a Network
+# ----------------
+# First, we need to define the network with relay frontend API.
+# We can load some pre-defined network from :code:`tvm.relay.testing`.
+# We can also load models from MXNet, ONNX, PyTorch, and TensorFlow
+# (see :ref:`front end tutorials<tutorial-frontend>`).
+#
+# Note that although auto-scheduler can work with any layouts,
+# we found that the best performance is typically archived with NHWC layout
+# for convolutional neural networks, so we use NHWC layout in this tutorial.
+#
+
+
+def get_network(name, batch_size, layout="NHWC", dtype="float32"):
+    """Get the symbol definition and random weight of a network"""
+
+    # auto-scheduler prefers NHWC layout
+    if layout == "NHWC":
+        image_shape = (224, 224, 3)
+    elif layout == "NCHW":
+        image_shape = (3, 224, 224)
+    else:
+        raise ValueError("Invalid layout: " + layout)
+
+    input_shape = (batch_size,) + image_shape
+    output_shape = (batch_size, 1000)
+
+    if name.startswith("resnet-"):
+        n_layer = int(name.split("-")[1])
+        mod, params = relay.testing.resnet.get_workload(
+            num_layers=n_layer,
+            batch_size=batch_size,
+            layout=layout,
+            dtype=dtype,
+            image_shape=image_shape,
+        )
+    elif name.startswith("resnet3d-"):
+        n_layer = int(name.split("-")[1])
+        mod, params = relay.testing.resnet.get_workload(
+            num_layers=n_layer,
+            batch_size=batch_size,
+            layout=layout,
+            dtype=dtype,
+            image_shape=image_shape,
+        )
+    elif name == "mobilenet":
+        mod, params = relay.testing.mobilenet.get_workload(
+            batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape
+        )
+    elif name == "squeezenet_v1.1":
+        mod, params = relay.testing.squeezenet.get_workload(
+            version="1.1",
+            batch_size=batch_size,
+            layout=layout,
+            dtype=dtype,
+            image_shape=image_shape,
+        )
+    elif name == "inception_v3":
+        input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3)
+        mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
+    elif name == "mxnet":
+        # an example for mxnet model
+        from mxnet.gluon.model_zoo.vision import get_model
+
+        assert layout == "NCHW"
+
+        block = get_model("resnet18_v1", pretrained=True)
+        mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)
+        net = mod["main"]
+        net = relay.Function(
+            net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs
+        )
+        mod = tvm.IRModule.from_expr(net)
+
+    return mod, params, input_shape, output_shape
+
+
+# Define the neural network and compilation target
+network = "resnet-18"
+batch_size = 1
+layout = "NHWC"
+target = tvm.target.Target("cuda")
+dtype = "float32"
+log_file = "%s-%s-B%d.json" % (network, layout, batch_size)
+
+#################################################################
+# Extract Search Tasks
+# --------------------
+# Next, we extract the search tasks and their weights from a network.
+# The weight of a task is the number of appearances of the task's subgraph
+# in the whole network.
+# By using the weight, we can approximate the end-to-end latency of the network
+# as :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the
+# latency of a task and :code:`weight[t]` is the weight of the task.
+# The task scheduler will just optimize this objective.
+
+# Enable auto-scheduler in relay
+auto_scheduler.enable_relay_integration()
+
+# Extract tasks from the network
+print("Extract tasks...")
+mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
+tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)
+
+#################################################################
+# Begin Tuning
+# ------------
+# Now, we set some options for tuning and launch the search tasks
+#
+# * :code:`measure_ctx` launches a different process for measurement to
+#   provide isolation. It can protect the master process from GPU crashes
+#   during measurement and avoid other runtime conflicts.
+# * :code:`min_repeat_ms` defines the minimum duration of one "repeat" in every measurement.
+#   This can warmup the GPU, which is necessary to get accurate measurement results.
+#   Typically, we recommend a value > 300 ms.
+# * :code:`num_measure_trials` is the number of measurement trials we can use during the tuning.
+#   You can set it to a small number (e.g., 200) for a fast demonstrative run.
+#   In practice, we recommend setting it around :code:`1000 * len(tasks)`,
+#   which is typically enough for the search to converge.
+#   For example, there are 21 tasks in resnet-18, so we can set it as 20000.
+#   You can adjust this parameter according to your time budget.
+# * In addition, we use :code:`RecordToFile` to dump measurement records into the log file,
+#   The measurement records can be used to query the history best, resume the search,
+#   and do more analyses later.
+# * see :any:`auto_scheduler.TuningOptions`,
+#   :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters.
+#
+
+
+def run_tuning():
+    print("Begin tuning...")
+    measure_ctx = auto_scheduler.LocalRPCMeasureContext(repeat=1, min_repeat_ms=400, timeout=10)
+
+    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
+    tune_option = auto_scheduler.TuningOptions(
+        num_measure_trials=200,  # change this to 20000 to achieve the best performance
+        runner=measure_ctx.runner,
+        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
+    )
+
+    tuner.tune(tune_option)
+
+
+# We do not run the tuning in our webpage server since it takes too long.
+# Uncomment the following line to run it by yourself.
+
+# run_tuning()
+
+
+######################################################################
+# .. note:: Explain the printed information during tuning
+#
+#   During the tuning, a lot of information will be printed on the console.
+#   They are used for debugging purposes. The most important info is the output
+#   of the task scheduler. The following table is a sample output.
+#
+#   .. code-block:: c
+#
+#     ----------------------------------------------------------------------
+#     ------------------------------  [ Task Scheduler ]
+#     ----------------------------------------------------------------------
+#     |  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
+#     -------------------------------------------------
+#     |    0 |        0.014 |          72.07 |     64 |
+#     |    1 |        0.185 |        1250.68 |    128 |
+#     |    2 |        0.142 |        1626.36 |    192 |
+#     |    3 |        0.137 |        1689.42 |    128 |
+#     |    4 |        0.097 |        1189.75 |    128 |
+#     |    5 |        0.092 |        2505.25 |    128 |
+#     |    6 |        0.080 |        2893.08 |    128 |
+#     |    7 |        0.119 |        1947.84 |    128 |
+#     |    8 |        0.090 |        1292.62 |     64 |
+#     |    9 |        0.107 |        2172.30 |     64 |
+#     |   10 |        0.095 |        2439.36 |     64 |
+#     |   11 |        0.077 |        3003.22 |     64 |
+#     |   12 |        0.068 |        1695.13 |     64 |
+#     |   13 |        0.058 |        3979.29 |     64 |
+#     |   14 |        0.048 |        4859.95 |    128 |
+#     |   15 |        0.073 |        3151.76 |     64 |
+#     |   16 |        0.056 |        4265.94 |     64 |
+#     |   17 |        0.009 |        2754.90 |     64 |
+#     |   18 |        0.011 |        1156.08 |     64 |
+#     |   19 |        0.013 |         955.80 |     64 |
+#     |   20 |        0.029 |         437.71 |     64 |
+#     -------------------------------------------------
+#     Estimated total latency: 1.649 ms  Trials: 1920  Used time : 3598 s  Next ID: 9
+#
+#   This table lists the latency and (estimated) speed of all tasks.
+#   It also lists the allocation of measurement trials for all tasks.
+#   The last line prints the total weighted latency of these tasks,
+#   which can be a rough estimation of the end-to-end execution time
+#   of the network.
+#   The last line also prints the total number of measurement trials,
+#   total time spent on auto-tuning and the id of the next task to tune.
+#
+#   There will also be some "dmlc::Error"s and CUDA errors, because the
+#   auto-scheduler will try some invalid schedules.
+#   You can safely ignore them if the tuning can continue, because these
+#   errors are isolated from the master process.
+#
+
+######################################################################
+# .. note:: Terminate the tuning earlier
+#
+#   You can terminate the tuning earlier by forcibly killing this process.
+#   As long as you get at least one valid schedule for each task in the log file,
+#   you should be able to do the compilation (the secion below).
+#
+
+
+#################################################################
+# Compile and Evaluate
+# --------------------
+# After auto-tuning, we can compile the network with the best schedules we found.
+# All measurement records are dumped into the log file during auto-tuning,
+# so we can read the log file and load the best schedules.
+
+# Compile with the history best
+print("Compile...")
+with auto_scheduler.ApplyHistoryBest(log_file):
+    with tvm.transform.PassContext(opt_level=3):
+        lib = relay.build(mod, target=target, params=params)
+
+# Create graph runtime
+ctx = tvm.context(str(target), 0)
+module = graph_runtime.GraphModule(lib["default"](ctx))
+data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
+module.set_input("data", data_tvm)
+
+# Evaluate
+print("Evaluate inference time cost...")
+ftimer = module.module.time_evaluator("run", ctx, repeat=3, min_repeat_ms=500)
+prof_res = np.array(ftimer().results) * 1e3  # convert to millisecond
+print("Mean inference time (std dev): %.2f ms (%.2f ms)" % (np.mean(prof_res), np.std(prof_res)))
+
+
+#################################################################
+# Other Tips
+# --------------------
+# 1. During the tuning, the auto-scheduler needs to compile many programs and
+#    extract feature from them. This part is CPU-intensive,
+#    so a high-performance CPU with many cores is recommended for faster search.
+# 2. If you have multiple GPUs, you can use all of them for measurements to
+#    parallelize the measurements. Check this :ref:`section <tutorials-autotvm-rpc-tracker>`
+#    to learn how to use the RPC Tracker and RPC Server.
+#    To use the RPC Tracker in auto-scheduler, replace the runner in :code:`TuningOptions`
+#    with :any:`auto_scheduler.RPCRunner`.
+#
diff --git a/docs/_downloads/2c0ed53a9ebd68caf76cd8235fae2711/tune_relay_mobile_gpu.ipynb b/docs/_downloads/2c0ed53a9ebd68caf76cd8235fae2711/tune_relay_mobile_gpu.ipynb
index 52ce8d8..032e56e 100644
--- a/docs/_downloads/2c0ed53a9ebd68caf76cd8235fae2711/tune_relay_mobile_gpu.ipynb
+++ b/docs/_downloads/2c0ed53a9ebd68caf76cd8235fae2711/tune_relay_mobile_gpu.ipynb
@@ -33,7 +33,7 @@
       },
       "outputs": [],
       "source": [
-        "import os\n\nimport numpy as np\n\nimport tvm\nfrom tvm import te\nfrom tvm import autotvm\nfrom tvm import relay\nimport tvm.relay.testing\nfrom tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner\nfrom tvm.contrib.utils import tempdir\nimport tvm.contrib.graph_runtime as runtime"
+        "import os\n\nimport numpy as np\n\nimport tvm\nfrom tvm import relay, autotvm\nimport tvm.relay.testing\nfrom tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner\nfrom tvm.contrib.utils import tempdir\nimport tvm.contrib.graph_runtime as runtime"
       ]
     },
     {
@@ -51,7 +51,7 @@
       },
       "outputs": [],
       "source": [
-        "def get_network(name, batch_size):\n    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n    input_shape = (batch_size, 3, 224, 224)\n    output_shape = (batch_size, 1000)\n\n    if \"resnet\" in name:\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.resnet.get_workload(\n            num_layers=n_layer, batch_size=batch_size, dtype=dtype\n        )\n    elif \"vgg\" in name:\n        n_layer = int(name.split(\"-\")[1])\n      [...]
+        "def get_network(name, batch_size):\n    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n    input_shape = (batch_size, 3, 224, 224)\n    output_shape = (batch_size, 1000)\n\n    if \"resnet\" in name:\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.resnet.get_workload(\n            num_layers=n_layer, batch_size=batch_size, dtype=dtype\n        )\n    elif \"vgg\" in name:\n        n_layer = int(name.split(\"-\")[1])\n      [...]
       ]
     },
     {
diff --git a/docs/_downloads/2c8ef0390ad4c53ca85671fa36c33b26/tune_conv2d_cuda.py b/docs/_downloads/2c8ef0390ad4c53ca85671fa36c33b26/tune_conv2d_cuda.py
index b307077..b662baf 100644
--- a/docs/_downloads/2c8ef0390ad4c53ca85671fa36c33b26/tune_conv2d_cuda.py
+++ b/docs/_downloads/2c8ef0390ad4c53ca85671fa36c33b26/tune_conv2d_cuda.py
@@ -53,8 +53,7 @@ import sys
 import numpy as np
 
 import tvm
-from tvm import te
-from tvm import topi
+from tvm import te, topi, testing
 from tvm.topi.testing import conv2d_nchw_python
 
 from tvm import autotvm
diff --git a/docs/_downloads/678f3c372a599a18d909aed0fefb30be/tune_conv2d_layer_cuda.py b/docs/_downloads/678f3c372a599a18d909aed0fefb30be/tune_conv2d_layer_cuda.py
index 42273bf..a8bb8dd 100644
--- a/docs/_downloads/678f3c372a599a18d909aed0fefb30be/tune_conv2d_layer_cuda.py
+++ b/docs/_downloads/678f3c372a599a18d909aed0fefb30be/tune_conv2d_layer_cuda.py
@@ -22,8 +22,7 @@ Auto-scheduling a convolution layer for GPU
 **Author**: `Lianmin Zheng <https://github.com/merrymercy>`_, \
             `Chengfan Jia <https://github.com/jcf94/>`_
 
-
-Different from the existing :ref:`autotvm <tutorials-autotvm-sec>` which relies on 
+Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which relies on
 manual templates to define the search space, the auto-scheduler does not require any templates.
 Users only need to write the computation declaration without any schedule commands or templates.
 The auto-scheduler can automatically generate a large search space and
@@ -77,11 +76,11 @@ print(task.compute_dag)
 
 ######################################################################
 # Next, we set parameters for the auto-scheduler. These parameters
-# mainly specify how we do the measurement during the search and auto-tuning.
+# mainly specify how we do the measurement during the search.
 #
-# * :code:`measure_ctx` launches a different process for measurement. This
-#   provides an isolation. It can protect the master process from GPU crashes
-#   happended during measurement and avoid other runtime conflicts.
+# * :code:`measure_ctx` launches a different process for measurement to
+#   provide isolation. It can protect the master process from GPU crashes
+#   during measurement and avoid other runtime conflicts.
 # * :code:`min_repeat_ms` defines the minimum duration of one "repeat" in every measurement.
 #   This can warmup the GPU, which is necessary to get accurate measurement results.
 #   Typically, we recommend a value > 300 ms.
@@ -97,7 +96,7 @@ print(task.compute_dag)
 log_file = "conv2d.json"
 measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
 tune_option = auto_scheduler.TuningOptions(
-    num_measure_trials=10,
+    num_measure_trials=10,  # change this to 1000 to achieve the best performance
     runner=measure_ctx.runner,
     measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
 )
@@ -182,7 +181,6 @@ func = tvm.build(sch, args, target)
 # and resume the status of search policy and cost model with the log file.
 # In the example below we resume the status and do more 5 trials.
 
-
 cost_model = auto_scheduler.XGBModel()
 cost_model.update_from_file(log_file)
 search_policy = auto_scheduler.SketchPolicy(
diff --git a/docs/_downloads/739deb9ab034a5315ce6ba6bf7e5ff44/tune_relay_cuda.ipynb b/docs/_downloads/739deb9ab034a5315ce6ba6bf7e5ff44/tune_relay_cuda.ipynb
index 8fb25bf..02c1e42 100644
--- a/docs/_downloads/739deb9ab034a5315ce6ba6bf7e5ff44/tune_relay_cuda.ipynb
+++ b/docs/_downloads/739deb9ab034a5315ce6ba6bf7e5ff44/tune_relay_cuda.ipynb
@@ -33,7 +33,7 @@
       },
       "outputs": [],
       "source": [
-        "import os\n\nimport numpy as np\n\nimport tvm\nfrom tvm import te\nfrom tvm import autotvm\nfrom tvm import relay\nimport tvm.relay.testing\nfrom tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner\nfrom tvm.contrib.utils import tempdir\nimport tvm.contrib.graph_runtime as runtime"
+        "import os\n\nimport numpy as np\n\nimport tvm\nfrom tvm import relay, autotvm\nimport tvm.relay.testing\nfrom tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner\nimport tvm.contrib.graph_runtime as runtime"
       ]
     },
     {
@@ -51,7 +51,7 @@
       },
       "outputs": [],
       "source": [
-        "def get_network(name, batch_size):\n    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n    input_shape = (batch_size, 3, 224, 224)\n    output_shape = (batch_size, 1000)\n\n    if \"resnet\" in name:\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.resnet.get_workload(\n            num_layers=n_layer, batch_size=batch_size, dtype=dtype\n        )\n    elif \"vgg\" in name:\n        n_layer = int(name.split(\"-\")[1])\n      [...]
+        "def get_network(name, batch_size):\n    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n    input_shape = (batch_size, 3, 224, 224)\n    output_shape = (batch_size, 1000)\n\n    if \"resnet\" in name:\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.resnet.get_workload(\n            num_layers=n_layer, batch_size=batch_size, dtype=dtype\n        )\n    elif \"vgg\" in name:\n        n_layer = int(name.split(\"-\")[1])\n      [...]
       ]
     },
     {
@@ -112,7 +112,7 @@
       },
       "outputs": [],
       "source": [
-        "def tune_and_evaluate(tuning_opt):\n    # extract workloads from relay program\n    print(\"Extract tasks...\")\n    mod, params, input_shape, out_shape = get_network(network, batch_size=1)\n    tasks = autotvm.task.extract_from_program(\n        mod[\"main\"], target=target, params=params, ops=(relay.op.get(\"nn.conv2d\"),)\n    )\n\n    # run tuning tasks\n    print(\"Tuning...\")\n    tune_tasks(tasks, **tuning_opt)\n\n    # compile kernels with history best records\n    with [...]
+        "def tune_and_evaluate(tuning_opt):\n    # extract workloads from relay program\n    print(\"Extract tasks...\")\n    mod, params, input_shape, out_shape = get_network(network, batch_size=1)\n    tasks = autotvm.task.extract_from_program(\n        mod[\"main\"], target=target, params=params, ops=(relay.op.get(\"nn.conv2d\"),)\n    )\n\n    # run tuning tasks\n    print(\"Tuning...\")\n    tune_tasks(tasks, **tuning_opt)\n\n    # compile kernels with history best records\n    with [...]
       ]
     },
     {
diff --git a/docs/_downloads/77322ea21ff00abad461e549895ef1d8/micro_reference_vm.py b/docs/_downloads/77322ea21ff00abad461e549895ef1d8/micro_reference_vm.py
new file mode 100644
index 0000000..dec1be9
--- /dev/null
+++ b/docs/_downloads/77322ea21ff00abad461e549895ef1d8/micro_reference_vm.py
@@ -0,0 +1,139 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""
+===================================
+microTVM Reference Virtual Machines
+===================================
+**Author**: `Andrew Reusch <ar...@octoml.ai>`_
+
+This tutorial explains how to launch microTVM Reference Virtual Machines. You can use these to
+develop on real physical hardware without needing to individually install the microTVM
+dependencies. These are also particularly useful when trying to reproduce behavior with
+microTVM, such as when filing bug reports.
+
+microTVM is the effort to allow TVM to build and execute models on bare-metal microcontrollers.
+microTVM aims to be compatible with a wide variety of SoCs and runtime environments (i.e. bare metal,
+RTOS, etc). However, some stable software environment is needed to allow developers to share and
+reproduce bugs and results. The microTVM Reference Virtual Machines are intended to provide that
+environment.
+
+How it works
+============
+
+No Virtual Machines are stored in the TVM repository--instead, the files stored in
+``apps/microtvm/reference-vm`` describe how to build VMs to the Vagrant_ VM builder tool.
+
+The Reference VMs are split into two parts:
+
+1. A Vagrant Base Box, which contains all of the stable dependencies for that platform. Build
+   scripts are stored in ``apps/microtvm/reference-vm/<platform>/base-box``. TVM committers run
+   these when a platform's "stable" dependencies change, and the generated base boxes are stored in
+   `Vagrant Cloud`_.
+2. A per-workspace VM, which users normally build using the Base Box as a starting point. Build
+   scripts are stored in ``apps/microtvm/reference-vm/<platform>`` (everything except ``base-box``).
+
+.. _Vagrant: https://vagrantup.com
+.. _Vagrant Cloud: https://app.vagrantup.com/tlcpack
+
+Setting up the VM
+=================
+
+Installing prerequisites
+------------------------
+
+A minimal set of prerequisites are needed:
+
+
+1. `Vagrant <https://vagrantup.com>`__
+2. A supported Virtual Machine hypervisor.
+   `VirtualBox <https://www.virtualbox.org>`__ is one suggested free hypervisor, but please note
+   that the `VirtualBox Extension Pack`_ is required for proper USB forwarding.
+
+.. _VirtualBox Extension Pack: https://www.virtualbox.org/wiki/Downloads#VirtualBox6.1.16OracleVMVirtualBoxExtensionPack
+
+First boot
+----------
+
+The first time you use a reference VM, you need to create the box locally and then provision it.
+
+.. code-block:: bash
+
+    ~/.../tvm $ cd apps/microtvm-vm
+    # Replace <provider_name> with the name of the hypervisor you wish to use (i.e. virtualbox).
+    ~/.../tvm/apps/microtvm/vm $ vagrant up --provider=<provider_name>
+
+
+This command will take a couple of minutes to run and will require 4 to 5GB of storage on your
+machine. It does the following:
+
+1. Downloads the `microTVM base box`_ and clones it to form a new VM specific to this TVM directory.
+2. Mounts your TVM directory (and, if using ``git-subtree``, the original ``.git`` repo) into the
+   VM.
+3. Builds TVM and installs a Python virtualenv with the dependencies corresponding with your TVM
+   build.
+
+.. _microTVM base box: https://app.vagrantup.com/tlcpack/boxes/microtvm
+
+
+Next, you need to configure USB passthrough to attach your physical development board to the virtual
+machine (rather than directly to your laptop's host OS).
+
+It's suggested you setup a device filter, rather than doing a one-time forward, because often the
+device may reboot during the programming process and you may, at that time, need to enable
+forwarding again. It may not be obvious to the end user when this occurs. Instructions to do that:
+
+ * `VirtualBox <https://www.virtualbox.org/manual/ch03.html#usb-support>`__
+ * `Parallels <https://kb.parallels.com/122993>`__
+ * `VMWare Workstation <https://docs.vmware.com/en/VMware-Workstation-Pro/15.0/com.vmware.ws.using.doc/GUID-E003456F-EB94-4B53-9082-293D9617CB5A.html>`__
+
+Future use
+----------
+
+After the first boot, you'll need to ensure you keep the build, in ``$TVM_HOME/build-microtvm``,
+up-to-date when you modify the C++ runtime or checkout a different revision. You can either
+re-provision the machine (``vagrant provision`` in the same directory you ran ``vagrant up`` before)
+or manually rebuild TVM yourself.
+
+Remember: the TVM ``.so`` built inside the VM is different from the one you may use on your host
+machine. This is why it's built inside the special directory ``build-microtvm``.
+
+Logging in to the VM
+--------------------
+
+The VM should be available to your host only with the hostname ``microtvm``. You can SSH to the VM
+as follows:
+
+.. code-block:: bash
+
+    $ vagrant ssh
+
+Then ``cd`` to the same path used on your host machine for TVM. For example, on Mac:
+
+.. code-block:: bash
+
+    $ cd /Users/yourusername/path/to/tvm
+
+Running tests
+=============
+
+Once the VM has been provisioned, tests can executed using ``poetry``:
+
+.. code-block:: bash
+
+    $ poetry run python3 tests/micro/qemu/test_zephyr.py --microtvm-platforms=stm32f746xx
+
+"""
diff --git a/docs/_downloads/85ba00b8ada85b8c5367f37b526a8caa/tune_relay_x86.py b/docs/_downloads/85ba00b8ada85b8c5367f37b526a8caa/tune_relay_x86.py
index b1b7ca2..5b3d032 100644
--- a/docs/_downloads/85ba00b8ada85b8c5367f37b526a8caa/tune_relay_x86.py
+++ b/docs/_downloads/85ba00b8ada85b8c5367f37b526a8caa/tune_relay_x86.py
@@ -32,9 +32,7 @@ import os
 import numpy as np
 
 import tvm
-from tvm import te
-from tvm import autotvm
-from tvm import relay
+from tvm import relay, autotvm
 from tvm.relay import testing
 from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
 from tvm.autotvm.graph_tuner import DPTuner, PBQPTuner
@@ -73,7 +71,7 @@ def get_network(name, batch_size):
             batch_size=batch_size, version="1.1", dtype=dtype
         )
     elif name == "inception_v3":
-        input_shape = (1, 3, 299, 299)
+        input_shape = (batch_size, 3, 299, 299)
         mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
     elif name == "mxnet":
         # an example for mxnet model
diff --git a/docs/_downloads/91b0339c8f3cc2594cee580dc450149a/tune_matmul_x86.py b/docs/_downloads/91b0339c8f3cc2594cee580dc450149a/tune_matmul_x86.py
index 0f2ebe0..2bd47de 100644
--- a/docs/_downloads/91b0339c8f3cc2594cee580dc450149a/tune_matmul_x86.py
+++ b/docs/_downloads/91b0339c8f3cc2594cee580dc450149a/tune_matmul_x86.py
@@ -20,7 +20,7 @@ Auto-scheduling matrix multiplication for CPU
 **Author**: `Lianmin Zheng <https://github.com/merrymercy>`_, \
             `Chengfan Jia <https://github.com/jcf94/>`_
 
-Different from the existing :ref:`autotvm <tutorials-autotvm-sec>` which relies on 
+Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which relies on
 manual templates to define the search space, the auto-scheduler does not require any templates.
 Users only need to write the computation declaration without any schedule commands or templates.
 The auto-scheduler can automatically generate a large search space and
diff --git a/docs/_downloads/b9891d1a23f84eec3271025d99d005f7/tune_relay_x86.ipynb b/docs/_downloads/b9891d1a23f84eec3271025d99d005f7/tune_relay_x86.ipynb
index f072ef2..2a07a04 100644
--- a/docs/_downloads/b9891d1a23f84eec3271025d99d005f7/tune_relay_x86.ipynb
+++ b/docs/_downloads/b9891d1a23f84eec3271025d99d005f7/tune_relay_x86.ipynb
@@ -26,7 +26,7 @@
       },
       "outputs": [],
       "source": [
-        "import os\nimport numpy as np\n\nimport tvm\nfrom tvm import te\nfrom tvm import autotvm\nfrom tvm import relay\nfrom tvm.relay import testing\nfrom tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner\nfrom tvm.autotvm.graph_tuner import DPTuner, PBQPTuner\nimport tvm.contrib.graph_runtime as runtime"
+        "import os\nimport numpy as np\n\nimport tvm\nfrom tvm import relay, autotvm\nfrom tvm.relay import testing\nfrom tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner\nfrom tvm.autotvm.graph_tuner import DPTuner, PBQPTuner\nimport tvm.contrib.graph_runtime as runtime"
       ]
     },
     {
@@ -44,7 +44,7 @@
       },
       "outputs": [],
       "source": [
-        "def get_network(name, batch_size):\n    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n    input_shape = (batch_size, 3, 224, 224)\n    output_shape = (batch_size, 1000)\n\n    if \"resnet\" in name:\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.resnet.get_workload(\n            num_layers=n_layer, batch_size=batch_size, dtype=dtype\n        )\n    elif \"vgg\" in name:\n        n_layer = int(name.split(\"-\")[1])\n      [...]
+        "def get_network(name, batch_size):\n    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n    input_shape = (batch_size, 3, 224, 224)\n    output_shape = (batch_size, 1000)\n\n    if \"resnet\" in name:\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.resnet.get_workload(\n            num_layers=n_layer, batch_size=batch_size, dtype=dtype\n        )\n    elif \"vgg\" in name:\n        n_layer = int(name.split(\"-\")[1])\n      [...]
       ]
     },
     {
diff --git a/docs/_downloads/baf1373314e0e040008107ff2571b4cd/tune_relay_arm.py b/docs/_downloads/baf1373314e0e040008107ff2571b4cd/tune_relay_arm.py
index 7514ee7..c69c7d9 100644
--- a/docs/_downloads/baf1373314e0e040008107ff2571b4cd/tune_relay_arm.py
+++ b/docs/_downloads/baf1373314e0e040008107ff2571b4cd/tune_relay_arm.py
@@ -66,9 +66,7 @@ import os
 
 import numpy as np
 import tvm
-from tvm import te
-from tvm import autotvm
-from tvm import relay
+from tvm import relay, autotvm
 import tvm.relay.testing
 from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
 from tvm.contrib.utils import tempdir
@@ -104,7 +102,7 @@ def get_network(name, batch_size):
             batch_size=batch_size, version="1.1", dtype=dtype
         )
     elif name == "inception_v3":
-        input_shape = (1, 3, 299, 299)
+        input_shape = (batch_size, 3, 299, 299)
         mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
     elif name == "mxnet":
         # an example for mxnet model
diff --git a/docs/_downloads/bcb4a24e8acc1ca84214bc8d7fb7954b/tune_conv2d_layer_cuda.ipynb b/docs/_downloads/bcb4a24e8acc1ca84214bc8d7fb7954b/tune_conv2d_layer_cuda.ipynb
index 6960e9b..03a713a 100644
--- a/docs/_downloads/bcb4a24e8acc1ca84214bc8d7fb7954b/tune_conv2d_layer_cuda.ipynb
+++ b/docs/_downloads/bcb4a24e8acc1ca84214bc8d7fb7954b/tune_conv2d_layer_cuda.ipynb
@@ -15,7 +15,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n\nAuto-scheduling a convolution layer for GPU\n===========================================\n**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_,             `Chengfan Jia <https://github.com/jcf94/>`_\n\n\nDifferent from the existing `autotvm <tutorials-autotvm-sec>` which relies on \nmanual templates to define the search space, the auto-scheduler does not require any templates.\nUsers only need to write the computation declaration without any schedule commands or tem [...]
+        "\n\nAuto-scheduling a convolution layer for GPU\n===========================================\n**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_,             `Chengfan Jia <https://github.com/jcf94/>`_\n\nDifferent from the template-based `autotvm <tutorials-autotvm-sec>` which relies on\nmanual templates to define the search space, the auto-scheduler does not require any templates.\nUsers only need to write the computation declaration without any schedule commands or  [...]
       ]
     },
     {
@@ -69,7 +69,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "Next, we set parameters for the auto-scheduler. These parameters\nmainly specify how we do the measurement during the search and auto-tuning.\n\n* :code:`measure_ctx` launches a different process for measurement. This\n  provides an isolation. It can protect the master process from GPU crashes\n  happended during measurement and avoid other runtime conflicts.\n* :code:`min_repeat_ms` defines the minimum duration of one \"repeat\" in every measurement.\n  This can warmup the GPU, [...]
+        "Next, we set parameters for the auto-scheduler. These parameters\nmainly specify how we do the measurement during the search.\n\n* :code:`measure_ctx` launches a different process for measurement to\n  provide isolation. It can protect the master process from GPU crashes\n  during measurement and avoid other runtime conflicts.\n* :code:`min_repeat_ms` defines the minimum duration of one \"repeat\" in every measurement.\n  This can warmup the GPU, which is necessary to get accura [...]
       ]
     },
     {
@@ -80,7 +80,7 @@
       },
       "outputs": [],
       "source": [
-        "log_file = \"conv2d.json\"\nmeasure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)\ntune_option = auto_scheduler.TuningOptions(\n    num_measure_trials=10,\n    runner=measure_ctx.runner,\n    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],\n)"
+        "log_file = \"conv2d.json\"\nmeasure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)\ntune_option = auto_scheduler.TuningOptions(\n    num_measure_trials=10,  # change this to 1000 to achieve the best performance\n    runner=measure_ctx.runner,\n    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],\n)"
       ]
     },
     {
diff --git a/docs/_downloads/dad91669fd0ea707f1374fe331b0dffe/tune_network_cuda.ipynb b/docs/_downloads/dad91669fd0ea707f1374fe331b0dffe/tune_network_cuda.ipynb
new file mode 100644
index 0000000..312cec7
--- /dev/null
+++ b/docs/_downloads/dad91669fd0ea707f1374fe331b0dffe/tune_network_cuda.ipynb
@@ -0,0 +1,147 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%matplotlib inline"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\nAuto-tuning a Neural Network for NVIDIA GPU\n===========================================\n**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_\n\nAuto-tuning for specific devices and workloads is critical for getting the\nbest performance. This is a tutorial on how to tune a whole neural\nnetwork for NVIDIA GPU with the auto-scheduler.\n\nTo auto-tune a neural network, we partition the network into small subgraphs and \ntune them independently. Each subgraph is treated [...]
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\n\nimport tvm\nfrom tvm import relay, auto_scheduler\nimport tvm.relay.testing\nfrom tvm.contrib import graph_runtime"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Define a Network\n----------------\nFirst, we need to define the network with relay frontend API.\nWe can load some pre-defined network from :code:`tvm.relay.testing`.\nWe can also load models from MXNet, ONNX, PyTorch, and TensorFlow\n(see `front end tutorials<tutorial-frontend>`).\n\nNote that although auto-scheduler can work with any layouts,\nwe found that the best performance is typically archived with NHWC layout\nfor convolutional neural networks, so we use NHWC layout in [...]
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "def get_network(name, batch_size, layout=\"NHWC\", dtype=\"float32\"):\n    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n\n    # auto-scheduler prefers NHWC layout\n    if layout == \"NHWC\":\n        image_shape = (224, 224, 3)\n    elif layout == \"NCHW\":\n        image_shape = (3, 224, 224)\n    else:\n        raise ValueError(\"Invalid layout: \" + layout)\n\n    input_shape = (batch_size,) + image_shape\n    output_shape = (batch_size, 1000)\n\n    [...]
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Extract Search Tasks\n--------------------\nNext, we extract the search tasks and their weights from a network.\nThe weight of a task is the number of appearances of the task's subgraph\nin the whole network.\nBy using the weight, we can approximate the end-to-end latency of the network\nas :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the\nlatency of a task and :code:`weight[t]` is the weight of the task.\nThe task scheduler will just optimize this objective.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "# Enable auto-scheduler in relay\nauto_scheduler.enable_relay_integration()\n\n# Extract tasks from the network\nprint(\"Extract tasks...\")\nmod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)\ntasks, task_weights = auto_scheduler.extract_tasks(mod[\"main\"], params, target)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Begin Tuning\n------------\nNow, we set some options for tuning and launch the search tasks\n\n* :code:`measure_ctx` launches a different process for measurement to\n  provide isolation. It can protect the master process from GPU crashes\n  during measurement and avoid other runtime conflicts.\n* :code:`min_repeat_ms` defines the minimum duration of one \"repeat\" in every measurement.\n  This can warmup the GPU, which is necessary to get accurate measurement results.\n  Typical [...]
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "def run_tuning():\n    print(\"Begin tuning...\")\n    measure_ctx = auto_scheduler.LocalRPCMeasureContext(repeat=1, min_repeat_ms=400, timeout=10)\n\n    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)\n    tune_option = auto_scheduler.TuningOptions(\n        num_measure_trials=200,  # change this to 20000 to achieve the best performance\n        runner=measure_ctx.runner,\n        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],\n    )\n\n    tuner.tune(tun [...]
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "<div class=\"alert alert-info\"><h4>Note</h4><p>Explain the printed information during tuning\n\n  During the tuning, a lot of information will be printed on the console.\n  They are used for debugging purposes. The most important info is the output\n  of the task scheduler. The following table is a sample output.\n\n  .. code-block:: c\n\n    ----------------------------------------------------------------------\n    ------------------------------  [ Task Scheduler ]\n    ----- [...]
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "<div class=\"alert alert-info\"><h4>Note</h4><p>Terminate the tuning earlier\n\n  You can terminate the tuning earlier by forcibly killing this process.\n  As long as you get at least one valid schedule for each task in the log file,\n  you should be able to do the compilation (the secion below).</p></div>\n\n\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Compile and Evaluate\n--------------------\nAfter auto-tuning, we can compile the network with the best schedules we found.\nAll measurement records are dumped into the log file during auto-tuning,\nso we can read the log file and load the best schedules.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "# Compile with the history best\nprint(\"Compile...\")\nwith auto_scheduler.ApplyHistoryBest(log_file):\n    with tvm.transform.PassContext(opt_level=3):\n        lib = relay.build(mod, target=target, params=params)\n\n# Create graph runtime\nctx = tvm.context(str(target), 0)\nmodule = graph_runtime.GraphModule(lib[\"default\"](ctx))\ndata_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))\nmodule.set_input(\"data\", data_tvm)\n\n# Evaluate\nprint(\"Evaluate [...]
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Other Tips\n--------------------\n1. During the tuning, the auto-scheduler needs to compile many programs and\n   extract feature from them. This part is CPU-intensive,\n   so a high-performance CPU with many cores is recommended for faster search.\n2. If you have multiple GPUs, you can use all of them for measurements to\n   parallelize the measurements. Check this `section <tutorials-autotvm-rpc-tracker>`\n   to learn how to use the RPC Tracker and RPC Server.\n   To use the R [...]
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.6.12"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
diff --git a/docs/_downloads/dfa0880631b34bb8814952afdc9031d8/tvmc_command_line_driver.ipynb b/docs/_downloads/dfa0880631b34bb8814952afdc9031d8/tvmc_command_line_driver.ipynb
index 5e0f8e6..d0abb94 100644
--- a/docs/_downloads/dfa0880631b34bb8814952afdc9031d8/tvmc_command_line_driver.ipynb
+++ b/docs/_downloads/dfa0880631b34bb8814952afdc9031d8/tvmc_command_line_driver.ipynb
@@ -100,7 +100,7 @@
       },
       "outputs": [],
       "source": [
-        "import os.path\nimport numpy as np\n\nfrom scipy.special import softmax\n\nfrom tvm.contrib.download import download_testdata\n\n# Download a list of labels\nlabels_url = \"https://s3.amazonaws.com/onnx-model-zoo/synset.txt\"\nlabels_path = download_testdata(labels_url, \"synset.txt\", module=\"data\")\n\nwith open(labels_path, \"r\") as f:\n    labels = [l.rstrip() for l in f]\n\noutput_file = \"predictions.npz\"\n\n# Open the output and read the output tensor\nif os.path.exist [...]
+        "import os.path\nimport numpy as np\n\nfrom scipy.special import softmax\n\nfrom tvm.contrib.download import download_testdata\n\n# Download a list of labels\nlabels_url = \"https://s3.amazonaws.com/onnx-model-zoo/synset.txt\"\nlabels_path = download_testdata(labels_url, \"synset.txt\", module=\"data\")\n\nwith open(labels_path, \"r\") as f:\n    labels = [l.rstrip() for l in f]\n\noutput_file = \"predictions.npz\"\n\n# Open the output and read the output tensor\nif os.path.exist [...]
       ]
     },
     {
diff --git a/docs/_downloads/e41367a7f459e4f4dca82180009c1539/tune_relay_mobile_gpu.py b/docs/_downloads/e41367a7f459e4f4dca82180009c1539/tune_relay_mobile_gpu.py
index b7fbf89..3611696 100644
--- a/docs/_downloads/e41367a7f459e4f4dca82180009c1539/tune_relay_mobile_gpu.py
+++ b/docs/_downloads/e41367a7f459e4f4dca82180009c1539/tune_relay_mobile_gpu.py
@@ -65,9 +65,7 @@ import os
 import numpy as np
 
 import tvm
-from tvm import te
-from tvm import autotvm
-from tvm import relay
+from tvm import relay, autotvm
 import tvm.relay.testing
 from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
 from tvm.contrib.utils import tempdir
@@ -103,7 +101,7 @@ def get_network(name, batch_size):
             batch_size=batch_size, version="1.1", dtype=dtype
         )
     elif name == "inception_v3":
-        input_shape = (1, 3, 299, 299)
+        input_shape = (batch_size, 3, 299, 299)
         mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
     elif name == "mxnet":
         # an example for mxnet model
diff --git a/docs/_downloads/f1a09967bab66114252357e4a9babb45/tune_matmul_x86.ipynb b/docs/_downloads/f1a09967bab66114252357e4a9babb45/tune_matmul_x86.ipynb
index ad43051..4c33490 100644
--- a/docs/_downloads/f1a09967bab66114252357e4a9babb45/tune_matmul_x86.ipynb
+++ b/docs/_downloads/f1a09967bab66114252357e4a9babb45/tune_matmul_x86.ipynb
@@ -15,7 +15,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\nAuto-scheduling matrix multiplication for CPU\n=============================================\n**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_,             `Chengfan Jia <https://github.com/jcf94/>`_\n\nDifferent from the existing `autotvm <tutorials-autotvm-sec>` which relies on \nmanual templates to define the search space, the auto-scheduler does not require any templates.\nUsers only need to write the computation declaration without any schedule commands or tem [...]
+        "\nAuto-scheduling matrix multiplication for CPU\n=============================================\n**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_,             `Chengfan Jia <https://github.com/jcf94/>`_\n\nDifferent from the template-based `autotvm <tutorials-autotvm-sec>` which relies on\nmanual templates to define the search space, the auto-scheduler does not require any templates.\nUsers only need to write the computation declaration without any schedule commands o [...]
       ]
     },
     {
diff --git a/docs/_downloads/f8f7a2adf30f5033603d79cdbacd9235/tune_relay_arm.ipynb b/docs/_downloads/f8f7a2adf30f5033603d79cdbacd9235/tune_relay_arm.ipynb
index b8bf4f4..6a0ef0c 100644
--- a/docs/_downloads/f8f7a2adf30f5033603d79cdbacd9235/tune_relay_arm.ipynb
+++ b/docs/_downloads/f8f7a2adf30f5033603d79cdbacd9235/tune_relay_arm.ipynb
@@ -33,7 +33,7 @@
       },
       "outputs": [],
       "source": [
-        "import os\n\nimport numpy as np\nimport tvm\nfrom tvm import te\nfrom tvm import autotvm\nfrom tvm import relay\nimport tvm.relay.testing\nfrom tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner\nfrom tvm.contrib.utils import tempdir\nimport tvm.contrib.graph_runtime as runtime"
+        "import os\n\nimport numpy as np\nimport tvm\nfrom tvm import relay, autotvm\nimport tvm.relay.testing\nfrom tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner\nfrom tvm.contrib.utils import tempdir\nimport tvm.contrib.graph_runtime as runtime"
       ]
     },
     {
@@ -51,7 +51,7 @@
       },
       "outputs": [],
       "source": [
-        "def get_network(name, batch_size):\n    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n    input_shape = (batch_size, 3, 224, 224)\n    output_shape = (batch_size, 1000)\n\n    if \"resnet\" in name:\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.resnet.get_workload(\n            num_layers=n_layer, batch_size=batch_size, dtype=dtype\n        )\n    elif \"vgg\" in name:\n        n_layer = int(name.split(\"-\")[1])\n      [...]
+        "def get_network(name, batch_size):\n    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n    input_shape = (batch_size, 3, 224, 224)\n    output_shape = (batch_size, 1000)\n\n    if \"resnet\" in name:\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.resnet.get_workload(\n            num_layers=n_layer, batch_size=batch_size, dtype=dtype\n        )\n    elif \"vgg\" in name:\n        n_layer = int(name.split(\"-\")[1])\n      [...]
       ]
     },
     {
diff --git a/docs/_images/sphx_glr_micro_reference_vm_thumb.png b/docs/_images/sphx_glr_micro_reference_vm_thumb.png
new file mode 100644
index 0000000..233f8e6
Binary files /dev/null and b/docs/_images/sphx_glr_micro_reference_vm_thumb.png differ
diff --git a/docs/_images/sphx_glr_tune_network_cuda_thumb.png b/docs/_images/sphx_glr_tune_network_cuda_thumb.png
new file mode 100644
index 0000000..233f8e6
Binary files /dev/null and b/docs/_images/sphx_glr_tune_network_cuda_thumb.png differ
diff --git a/docs/_sources/deploy/arm_compute_lib.rst.txt b/docs/_sources/deploy/arm_compute_lib.rst.txt
index 5dd0076..a2eaa5f 100644
--- a/docs/_sources/deploy/arm_compute_lib.rst.txt
+++ b/docs/_sources/deploy/arm_compute_lib.rst.txt
@@ -36,7 +36,7 @@ determine the architecture by looking online.
 
 We recommend two different ways to build and install ACL:
 
-* Use the script located at `docker/install/ubuntu_install_arm_compute_library.sh`. You can use this
+* Use the script located at `docker/install/ubuntu_install_arm_compute_lib.sh`. You can use this
   script for building ACL from source natively or for cross-compiling the library on an x86 machine.
   You may need to change the architecture of the device you wish to compile for by altering the
   `target_arch` variable. Binaries will be built from source and installed to the location denoted by
diff --git a/docs/_sources/deploy/index.rst.txt b/docs/_sources/deploy/index.rst.txt
index 68843ba..e47b0a3 100644
--- a/docs/_sources/deploy/index.rst.txt
+++ b/docs/_sources/deploy/index.rst.txt
@@ -70,3 +70,4 @@ target device without relying on RPC. see the following resources on how to do s
    hls
    arm_compute_lib
    tensorrt
+   vitis_ai
diff --git a/docs/_sources/deploy/vitis_ai.rst.txt b/docs/_sources/deploy/vitis_ai.rst.txt
new file mode 100644
index 0000000..f0bd3ed
--- /dev/null
+++ b/docs/_sources/deploy/vitis_ai.rst.txt
@@ -0,0 +1,652 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+
+Vitis-AI Integration
+====================
+
+`Vitis-AI <https://github.com/Xilinx/Vitis-AI>`__ is Xilinx's
+development stack for hardware-accelerated AI inference on Xilinx
+platforms, including both edge devices and Alveo cards. It consists of
+optimized IP, tools, libraries, models, and example designs. It is
+designed with high efficiency and ease of use in mind, unleashing the
+full potential of AI acceleration on Xilinx FPGA and ACAP.
+
+The current Vitis-AI Byoc flow inside TVM enables acceleration of Neural
+Network model inference on edge and cloud. The identifiers for the
+supported edge and cloud Deep Learning Processor Units (DPU's) are
+DPUCZDX8G respectively DPUCADX8G. DPUCZDX8G and DPUCADX8G are hardware
+accelerators for convolutional neural networks (CNN's) on top of the
+Xilinx `Zynq Ultrascale+
+MPSoc <https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html>`__
+respectively
+`Alveo <https://www.xilinx.com/products/boards-and-kits/alveo.html>`__
+(U200/U250) platforms. For more information about the DPU identifiers
+see the section on `DPU naming information <#dpu-naming-information>`__.
+
+On this page you will find information on how to
+`build <#build-instructions>`__ TVM with Vitis-AI and on how to `get
+started <#getting-started>`__ with an example.
+
+DPU naming information
+----------------------
+
++---------------------------------+-----------------+-------------------------------------------------------------------------+------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------+
+| DPU                             | Application     | HW Platform                                                             | Quantization Method                                        | Quantization Bitwidth                             | Design Target                                                            |
++=================================+=================+=========================================================================+============================================================+===================================================+==========================================================================+
+| Deep Learning Processing Unit   | C: CNN R: RNN   | AD: Alveo DDR AH: Alveo HBM VD: Versal DDR with AIE & PL ZD: Zynq DDR   | X: DECENT I: Integer threshold F: Float threshold R: RNN   | 4: 4-bit 8: 8-bit 16: 16-bit M: Mixed Precision   | G: General purpose H: High throughput L: Low latency C: Cost optimized   |
++---------------------------------+-----------------+-------------------------------------------------------------------------+------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------+
+
+Build instructions
+------------------
+
+This section lists the instructions for building TVM with Vitis-AI for
+both `cloud <#cloud-dpucadx8g>`__ and `edge <#edge-dpuczdx8g>`__.
+
+Cloud (DPUCADX8G)
+~~~~~~~~~~~~~~~~~
+
+For Vitis-AI acceleration in the cloud TVM has to be built on top of the
+Xilinx Alveo platform.
+
+System requirements
+^^^^^^^^^^^^^^^^^^^
+
+The following table lists system requirements for running docker
+containers as well as Alveo cards.
+
++-----------------------------------------------------+----------------------------------------------------------+
+| **Component**                                       | **Requirement**                                          |
++=====================================================+==========================================================+
+| Motherboard                                         | PCI Express 3.0-compliant with one dual-width x16 slot   |
++-----------------------------------------------------+----------------------------------------------------------+
+| System Power Supply                                 | 225W                                                     |
++-----------------------------------------------------+----------------------------------------------------------+
+| Operating System                                    | Ubuntu 16.04, 18.04                                      |
++-----------------------------------------------------+----------------------------------------------------------+
+|                                                     | CentOS 7.4, 7.5                                          |
++-----------------------------------------------------+----------------------------------------------------------+
+|                                                     | RHEL 7.4, 7.5                                            |
++-----------------------------------------------------+----------------------------------------------------------+
+| CPU                                                 | Intel i3/i5/i7/i9/Xeon 64-bit CPU                        |
++-----------------------------------------------------+----------------------------------------------------------+
+| GPU (Optional to accelerate quantization)           | NVIDIA GPU with a compute capability > 3.0               |
++-----------------------------------------------------+----------------------------------------------------------+
+| CUDA Driver (Optional to accelerate quantization)   | nvidia-410                                               |
++-----------------------------------------------------+----------------------------------------------------------+
+| FPGA                                                | Xilinx Alveo U200 or U250                                |
++-----------------------------------------------------+----------------------------------------------------------+
+| Docker Version                                      | 19.03.1                                                  |
++-----------------------------------------------------+----------------------------------------------------------+
+
+Hardware setup and docker build
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+1. Clone the Vitis AI repository:
+
+   .. code:: bash
+
+      git clone --recurse-submodules https://github.com/Xilinx/Vitis-AI
+   
+2. Install Docker, and add the user to the docker group. Link the user
+   to docker installation instructions from the following docker's
+   website:
+   
+
+   -  https://docs.docker.com/install/linux/docker-ce/ubuntu/
+   -  https://docs.docker.com/install/linux/docker-ce/centos/
+   -  https://docs.docker.com/install/linux/linux-postinstall/
+
+3. Download the latest Vitis AI Docker with the following command. This container runs on CPU.
+
+   .. code:: bash
+      
+      docker pull xilinx/vitis-ai:latest
+    
+   To accelerate the quantization, you can optionally use the Vitis-AI GPU docker image. Use the below commands to build the Vitis-AI GPU docker container:
+   
+   .. code:: bash
+
+      cd Vitis-AI/docker
+      ./docker_build_gpu.sh
+
+4. Set up Vitis AI to target Alveo cards. To target Alveo cards with
+   Vitis AI for machine learning workloads, you must install the
+   following software components:
+
+   -  Xilinx Runtime (XRT)
+   -  Alveo Deployment Shells (DSAs)
+   -  Xilinx Resource Manager (XRM) (xbutler)
+   -  Xilinx Overlaybins (Accelerators to Dynamically Load - binary
+      programming files)
+
+   While it is possible to install all of these software components
+   individually, a script has been provided to automatically install
+   them at once. To do so:
+
+   -  Run the following commands:
+
+      .. code:: bash
+      
+         cd Vitis-AI/alveo/packages
+         sudo su
+         ./install.sh
+      
+   -  Power cycle the system.
+   
+5. Clone tvm repo and pyxir repo
+
+   .. code:: bash
+     
+      git clone --recursive https://github.com/apache/incubator-tvm.git
+      git clone --recursive https://github.com/Xilinx/pyxir.git
+   
+6. Build and start the tvm runtime Vitis-AI Docker Container.
+
+   .. code:: bash
+
+      ./incubator-tvm/docker/build.sh demo_vitis_ai bash
+      ./incubator-tvm/docker/bash.sh tvm.demo_vitis_ai
+	  
+      #Setup inside container
+      source /opt/xilinx/xrt/setup.sh
+      . $VAI_ROOT/conda/etc/profile.d/conda.sh
+      conda activate vitis-ai-tensorflow
+      
+7. Install PyXIR
+
+   .. code:: bash
+
+     cd pyxir
+     python3 setup.py install --use_vai_rt_dpucadx8g --user
+
+   
+8. Build TVM inside the container with Vitis-AI
+
+   .. code:: bash
+
+      cd incubator-tvm
+      mkdir build
+      cp cmake/config.cmake build
+      cd build  
+      echo set\(USE_LLVM ON\) >> config.cmake
+      echo set\(USE_VITIS_AI ON\) >> config.cmake
+      cmake ..
+      make -j$(nproc)
+   
+9.  Install TVM
+
+    .. code:: bash
+
+      cd incubator-tvm/python
+      pip3 install -e . --user
+      
+Edge (DPUCZDX8G)
+^^^^^^^^^^^^^^^^
+
+
+For edge deployment we make use of two systems referred to as host and
+edge. The `host <#host-requirements>`__ system is responsible for
+quantization and compilation of the neural network model in a first
+offline step. Afterwards, the model will de deployed on the
+`edge <#edge-requirements>`__ system.
+
+Host requirements
+^^^^^^^^^^^^^^^^^
+
+The following table lists system requirements for running the TVM -
+Vitis-AI docker container.
+
++-----------------------------------------------------+----------------------------------------------+
+| **Component**                                       | **Requirement**                              |
++=====================================================+==============================================+
+| Operating System                                    | Ubuntu 16.04, 18.04                          |
++-----------------------------------------------------+----------------------------------------------+
+|                                                     | CentOS 7.4, 7.5                              |
++-----------------------------------------------------+----------------------------------------------+
+|                                                     | RHEL 7.4, 7.5                                |
++-----------------------------------------------------+----------------------------------------------+
+| CPU                                                 | Intel i3/i5/i7/i9/Xeon 64-bit CPU            |
++-----------------------------------------------------+----------------------------------------------+
+| GPU (Optional to accelerate quantization)           | NVIDIA GPU with a compute capability > 3.0   |
++-----------------------------------------------------+----------------------------------------------+
+| CUDA Driver (Optional to accelerate quantization)   | nvidia-410                                   |
++-----------------------------------------------------+----------------------------------------------+
+| FPGA                                                | Not necessary on host                        |
++-----------------------------------------------------+----------------------------------------------+
+| Docker Version                                      | 19.03.1                                      |
++-----------------------------------------------------+----------------------------------------------+
+
+Host setup and docker build
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+1. Clone tvm repo
+
+   .. code:: bash
+
+      git clone --recursive https://github.com/apache/incubator-tvm.git
+2. Build and start the tvm runtime Vitis-AI Docker Container.
+
+   .. code:: bash
+
+      cd incubator-tvm 
+      ./incubator-tvm/docker/build.sh demo_vitis_ai bash
+      ./incubator-tvm/docker/bash.sh tvm.demo_vitis_ai
+   
+      #Setup inside container
+      . $VAI_ROOT/conda/etc/profile.d/conda.sh
+      conda activate vitis-ai-tensorflow
+   
+3. Install PyXIR
+
+   .. code:: bash
+
+      git clone --recursive https://github.com/Xilinx/pyxir.git
+      cd pyxir
+      python3 setup.py install --user
+   
+   
+4. Build TVM inside the container with Vitis-AI.
+
+   .. code:: bash
+
+      cd incubator-tvm 
+      mkdir build
+      cp cmake/config.cmake build
+      cd build
+      echo set\(USE_LLVM ON\) >> config.cmake
+      echo set\(USE_VITIS_AI ON\) >> config.cmake
+      cmake ..
+      make -j$(nproc)
+   
+5. Install TVM
+
+   .. code:: bash
+
+      cd incubator-tvm/python
+      pip3 install -e . --user
+
+Edge requirements
+^^^^^^^^^^^^^^^^^
+
+The DPUCZDX8G can be deployed on the `Zynq Ultrascale+
+MPSoc <https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html>`__
+platform. The following development boards can be used out-of-the-box:
+
++--------------------+----------------------+-----------------------------------------------------------------------+
+| **Target board**   | **TVM identifier**   | **Info**                                                              |
++====================+======================+=======================================================================+
+| Ultra96            | DPUCZDX8G-ultra96    | https://www.xilinx.com/products/boards-and-kits/1-vad4rl.html         |
++--------------------+----------------------+-----------------------------------------------------------------------+
+| ZCU104             | DPUCZDX8G-zcu104     | https://www.xilinx.com/products/boards-and-kits/zcu104.html           |
++--------------------+----------------------+-----------------------------------------------------------------------+
+| ZCU102             | DPUCZDX8G-zcu102     | https://www.xilinx.com/products/boards-and-kits/ek-u1-zcu102-g.html   |
++--------------------+----------------------+-----------------------------------------------------------------------+
+
+Edge hardware setup
+^^^^^^^^^^^^^^^^^^^
+.. note:: 
+
+  This section provides instructions for setting up with the `Pynq <http://www.pynq.io/>`__ platform but 
+  Petalinux based flows are also supported. 
+
+1. Download the Pynq v2.5 image for your target (use Z1 or Z2 for
+   Ultra96 target depending on board version) Link to image:
+   https://github.com/Xilinx/PYNQ/releases/tag/v2.5
+2. Follow Pynq instructions for setting up the board: `pynq
+   setup <https://pynq.readthedocs.io/en/latest/getting_started.html>`__
+3. After connecting to the board, make sure to run as root. Execute
+   ``su``
+4. Set up DPU on Pynq by following the steps here: `DPU Pynq
+   setup <https://github.com/Xilinx/DPU-PYNQ>`__
+5. Run the following command to download the DPU bitstream:
+
+   .. code:: bash
+
+     python3 -c 'from pynq_dpu import DpuOverlay ; overlay = DpuOverlay("dpu.bit")'
+  
+6. Check whether the DPU kernel is alive:
+
+   .. code:: bash
+
+     dexplorer -w
+
+Edge TVM setup
+^^^^^^^^^^^^^^
+
+.. note:: 
+
+  When working on Petalinux instead of Pynq, the following steps might take more manual work (e.g building     
+  hdf5 from source). Also, TVM has a scipy dependency which you then might have to build from source or 
+  circumvent. We don't depend on scipy in our flow.
+
+Building TVM depends on the Xilinx
+`PyXIR <https://github.com/Xilinx/pyxir>`__ package. PyXIR acts as an
+interface between TVM and Vitis-AI tools.
+
+1. First install the PyXIR h5py and pydot dependencies:
+
+   .. code:: bash
+
+      apt-get install libhdf5-dev
+      pip3 install pydot h5py
+      
+2. Install PyXIR
+
+   .. code:: bash
+
+      git clone --recursive https://github.com/Xilinx/pyxir.git
+      cd pyxir
+      sudo python3 setup.py install --use_vai_rt_dpuczdx8g
+   
+3. Build TVM with Vitis-AI
+
+   .. code:: bash
+
+      git clone --recursive https://github.com/apache/incubator-tvm
+      cd incubator-tvm
+      mkdir build
+      cp cmake/config.cmake build
+      cd build
+      echo set\(USE_VITIS_AI ON\) >> config.cmake
+      cmake ..     
+      make
+   
+4. Install TVM
+
+   .. code:: bash
+
+      cd incubator-tvm/python
+      pip3 install -e . --user
+
+5. Check whether the setup was successful in the Python shell:
+
+   .. code:: bash
+
+      python3 -c 'import pyxir; import tvm'
+
+
+Getting started
+---------------
+
+This section shows how to use TVM with Vitis-AI. For this it's important
+to understand that neural network models are quantized for Vitis-AI
+execution in fixed point arithmetic. The approach we take here is to
+quantize on-the-fly using the first N inputs as explained in the next
+section.
+
+On-the-fly quantization
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Usually, to be able to accelerate inference of Neural Network models
+with Vitis-AI DPU accelerators, those models need to quantized upfront.
+In TVM - Vitis-AI flow, we make use of on-the-fly quantization to remove
+this additional preprocessing step. In this flow, one doesn't need to
+quantize his/her model upfront but can make use of the typical inference
+execution calls (module.run) to quantize the model on-the-fly using the
+first N inputs that are provided (see more information below). This will
+set up and calibrate the Vitis-AI DPU and from that point onwards
+inference will be accelerated for all next inputs. Note that the edge
+flow deviates slightly from the explained flow in that inference won't
+be accelerated after the first N inputs but the model will have been
+quantized and compiled and can be moved to the edge device for
+deployment. Please check out the `edge <#Edge%20usage>`__ usage
+instructions below for more information.
+
+Config/Settings
+~~~~~~~~~~~~~~~
+
+A couple of environment variables can be used to customize the Vitis-AI
+Byoc flow.
+
++----------------------------+----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| **Environment Variable**   | **Default if unset**                   | **Explanation**                                                                                                                                                                                                                                                                                                                            |
++============================+========================================+============================================================================================================================================================================================================================================================================================================================================+
+| PX\_QUANT\_SIZE            | 128                                    | The number of inputs that will be used for quantization (necessary for Vitis-AI acceleration)                                                                                                                                                                                                                                              |
++----------------------------+----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| PX\_BUILD\_DIR             | Use the on-the-fly quantization flow   | Loads the quantization and compilation information from the provided build directory and immediately starts Vitis-AI hardware acceleration. This configuration can be used if the model has been executed before using on-the-fly quantization during which the quantization and comilation information was cached in a build directory.   |
++----------------------------+----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+
+Cloud usage
+~~~~~~~~~~~
+
+This section shows how to accelerate a convolutional neural network
+model in TVM with Vitis-AI on the cloud.
+
+To be able to target the Vitis-AI cloud DPUCADX8G target we first have
+to import the target in PyXIR. This PyXIR package is the interface being
+used by TVM to integrate with the Vitis-AI stack. Additionaly, import
+the typical TVM and Relay modules and the Vitis-AI contrib module inside
+TVM.
+
+.. code:: python
+
+   import pyxir
+   import pyxir.contrib.target.DPUCADX8G
+
+   import tvm
+   import tvm.relay as relay
+   from tvm.contrib.target import vitis_ai
+   from tvm.contrib import util, graph_runtime
+   from tvm.relay.build_module import bind_params_by_name
+   from tvm.relay.op.contrib.vitis_ai import annotation
+
+After importing a convolutional neural network model using the usual
+Relay API's, annotate the Relay expression for the given Vitis-AI DPU
+target and partition the graph.
+
+.. code:: python
+
+   mod["main"] = bind_params_by_name(mod["main"], params)
+   mod = annotation(mod, params, target)
+   mod = relay.transform.MergeCompilerRegions()(mod)
+   mod = relay.transform.PartitionGraph()(mod)
+
+Now, we can build the TVM runtime library for executing the model. The
+TVM target is 'llvm' as the operations that can't be handled by the DPU
+are executed on the CPU. The Vitis-AI target is DPUCADX8G as we are
+targeting the cloud DPU and this target is passed as a config to the TVM
+build call.
+
+.. code:: python
+
+   tvm_target = 'llvm'
+   target='DPUCADX8G'
+
+   with tvm.transform.PassContext(opt_level=3, config= {'relay.ext.vitis_ai.options.target': target}):   
+      lib = relay.build(mod, tvm_target, params=params)
+
+As one more step before we can accelerate a model with Vitis-AI in TVM
+we have to quantize and compile the model for execution on the DPU. We
+make use of on-the-fly quantization for this. Using this method one
+doesn’t need to quantize their model upfront and can make use of the
+typical inference execution calls (module.run) to calibrate the model
+on-the-fly using the first N inputs that are provided. After the first N
+iterations, computations will be accelerated on the DPU. So now we will
+feed N inputs to the TVM runtime module. Note that these first N inputs
+will take a substantial amount of time.
+
+.. code:: python
+
+   module = graph_runtime.GraphModule(lib["default"](tvm.cpu()))
+
+   # First N (default = 128) inputs are used for quantization calibration and will
+   # be executed on the CPU
+   # This config can be changed by setting the 'PX_QUANT_SIZE' (e.g. export PX_QUANT_SIZE=64)
+   for i in range(128):
+      module.set_input(input_name, inputs[i]) 
+      module.run()
+
+Afterwards, inference will be accelerated on the DPU.
+
+.. code:: python
+
+   module.set_input(name, data)
+   module.run()
+
+To save and load the built module, one can use the typical TVM API's:
+
+.. code:: python
+
+   lib_path = "deploy_lib.so"
+   lib.export_library(lib_path)
+
+Load the module from compiled files and run inference
+
+.. code:: python
+
+   # load the module into memory
+   loaded_lib = tvm.runtime.load_module(lib_path)
+
+   module = graph_runtime.GraphModule(lib["default"](tvm.cpu()))
+   module.set_input(name, data)
+   module.run()
+
+Edge usage
+~~~~~~~~~~
+
+This section shows how to accelerate a convolutional neural network
+model in TVM with Vitis-AI at the edge. The first couple of steps will
+have to be run on the host machine and take care of quantization and
+compilation for deployment at the edge.
+
+Host steps
+^^^^^^^^^^
+
+To be able to target the Vitis-AI cloud DPUCZDX8G target we first have
+to import the target in PyXIR. This PyXIR package is the interface being
+used by TVM to integrate with the Vitis-AI stack. Additionaly, import
+the typical TVM and Relay modules and the Vitis-AI contrib module inside
+TVM.
+
+.. code:: python
+
+   import pyxir
+   import pyxir.contrib.target.DPUCZDX8G
+
+   import tvm
+   import tvm.relay as relay
+   from tvm.contrib.target import vitis_ai
+   from tvm.contrib import util, graph_runtime
+   from tvm.relay.build_module import bind_params_by_name
+   from tvm.relay.op.contrib.vitis_ai import annotation
+
+After importing a convolutional neural network model using the usual
+Relay API's, annotate the Relay expression for the given Vitis-AI DPU
+target and partition the graph.
+
+.. code:: python
+
+   mod["main"] = bind_params_by_name(mod["main"], params)
+   mod = annotation(mod, params, target)
+   mod = relay.transform.MergeCompilerRegions()(mod)
+   mod = relay.transform.PartitionGraph()(mod)
+
+Now, we can build the TVM runtime library for executing the model. The
+TVM target is 'llvm' as the operations that can't be handled by the DPU
+are executed on the CPU. At this point that means the CPU on the host machine.
+The Vitis-AI target is DPUCZDX8G-zcu104 as we are targeting the edge DPU
+on the ZCU104 board and this target is passed as a config to the TVM
+build call. Note that different identifiers can be passed for different
+targets, see `edge targets info <#edge-requirements>`__. Additionally, we
+provide the 'export_runtime_module' config that points to a file to which we 
+can export the Vitis-AI runtime module. We have to do this because we will
+first be compiling and quantizing the model on the host machine before building
+the model for edge deployment. As you will see later on, the exported runtime 
+module will be passed to the edge build so that the Vitis-AI runtime module 
+can be included.
+
+.. code:: python
+
+   from tvm.contrib import util
+
+   temp = util.tempdir()
+   
+   tvm_target = 'llvm'
+   target='DPUCZDX8G-zcu104'
+   export_rt_mod_file = temp.relpath("vitis_ai.rtmod")
+  
+   with tvm.transform.PassContext(opt_level=3, config= {'relay.ext.vitis_ai.options.target': target,
+   						        'relay.ext.vitis_ai.options.export_runtime_module': export_rt_mod_file}):   
+      lib = relay.build(mod, tvm_target, params=params)
+      
+We will quantize and compile the model for execution on the DPU using on-the-fly 
+quantization on the host machine. This makes use of TVM inference calls 
+(module.run) to quantize the model on the host with the first N inputs.
+
+.. code:: python
+
+   module = graph_runtime.GraphModule(lib["default"](tvm.cpu()))
+
+   # First N (default = 128) inputs are used for quantization calibration and will
+   # be executed on the CPU
+   # This config can be changed by setting the 'PX_QUANT_SIZE' (e.g. export PX_QUANT_SIZE=64)
+   for i in range(128):
+      module.set_input(input_name, inputs[i]) 
+      module.run()
+      
+Save the TVM lib module so that the Vitis-AI runtime module will also be exported 
+(to the 'export_runtime_module' path we previously passed as a config).
+
+.. code:: python
+
+   from tvm.contrib import util
+
+   temp = util.tempdir()
+   lib.export_library(temp.relpath("tvm_lib.so"))
+
+After quantizing and compiling the model for Vitis-AI acceleration using the 
+first N inputs we can build the model for execution on the ARM edge device. 
+Here we pass the previously exported Vitis-AI runtime module so it can be included 
+in the TVM build.
+
+.. code:: python
+
+   # Export lib for aarch64 target
+   tvm_target = tvm.target.arm_cpu('ultra96')
+   lib_kwargs = {
+        'fcompile': contrib.cc.create_shared,
+        'cc': "/usr/aarch64-linux-gnu/bin/ld"
+   }
+
+   with tvm.transform.PassContext(opt_level=3,
+                                  config={'relay.ext.vitis_ai.options.load_runtime_module': export_rt_mod_file}):
+        lib_arm = relay.build(mod, tvm_target, params=params)
+
+   lib_dpuv2.export_library('tvm_dpu_arm.so', **lib_kwargs)
+
+Now, move the TVM build files (tvm\_dpu\_arm.json, tvm\_dpu\_arm.so,
+tvm\_dpu\_arm.params) to the edge device. For information on setting
+up the edge device check out the `edge setup <#edge-dpuczdx8g>`__
+section.
+
+Edge steps
+^^^^^^^^^^
+
+After setting up TVM with Vitis-AI on the edge device, you can now load 
+the TVM runtime module into memory and feed inputs for inference.
+
+.. code:: python
+
+   ctx = tvm.cpu()
+
+   # load the module into memory
+   lib = tvm.runtime.load_module("tvm_dpu_arm.so")
+
+   module = graph_runtime.GraphModule(lib["default"](tvm.cpu()))
+   module.set_input(name, data)
+   module.run()
diff --git a/docs/_sources/tutorials/auto_scheduler/sg_execution_times.rst.txt b/docs/_sources/tutorials/auto_scheduler/sg_execution_times.rst.txt
index 31b5288..eefb895 100644
--- a/docs/_sources/tutorials/auto_scheduler/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/auto_scheduler/sg_execution_times.rst.txt
@@ -5,7 +5,8 @@
 
 Computation times
 =================
-**04:25.488** total execution time for **tutorials_auto_scheduler** files:
+**05:05.432** total execution time for **tutorials_auto_scheduler** files:
 
-- **02:40.668**: :ref:`sphx_glr_tutorials_auto_scheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``)
-- **01:44.820**: :ref:`sphx_glr_tutorials_auto_scheduler_tune_matmul_x86.py` (``tune_matmul_x86.py``)
+- **02:46.397**: :ref:`sphx_glr_tutorials_auto_scheduler_tune_conv2d_layer_cuda.py` (``tune_conv2d_layer_cuda.py``)
+- **01:54.000**: :ref:`sphx_glr_tutorials_auto_scheduler_tune_matmul_x86.py` (``tune_matmul_x86.py``)
+- **00:25.034**: :ref:`sphx_glr_tutorials_auto_scheduler_tune_network_cuda.py` (``tune_network_cuda.py``)
diff --git a/docs/_sources/tutorials/auto_scheduler/tune_conv2d_layer_cuda.rst.txt b/docs/_sources/tutorials/auto_scheduler/tune_conv2d_layer_cuda.rst.txt
index 692fbba..37ec121 100644
--- a/docs/_sources/tutorials/auto_scheduler/tune_conv2d_layer_cuda.rst.txt
+++ b/docs/_sources/tutorials/auto_scheduler/tune_conv2d_layer_cuda.rst.txt
@@ -13,8 +13,7 @@ Auto-scheduling a convolution layer for GPU
 ===========================================
 **Author**: `Lianmin Zheng <https://github.com/merrymercy>`_,             `Chengfan Jia <https://github.com/jcf94/>`_
 
-
-Different from the existing :ref:`autotvm <tutorials-autotvm-sec>` which relies on 
+Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which relies on
 manual templates to define the search space, the auto-scheduler does not require any templates.
 Users only need to write the computation declaration without any schedule commands or templates.
 The auto-scheduler can automatically generate a large search space and
@@ -109,11 +108,11 @@ We then create a search task for the last convolution layer in the resnet.
 
 
 Next, we set parameters for the auto-scheduler. These parameters
-mainly specify how we do the measurement during the search and auto-tuning.
+mainly specify how we do the measurement during the search.
 
-* :code:`measure_ctx` launches a different process for measurement. This
-  provides an isolation. It can protect the master process from GPU crashes
-  happended during measurement and avoid other runtime conflicts.
+* :code:`measure_ctx` launches a different process for measurement to
+  provide isolation. It can protect the master process from GPU crashes
+  during measurement and avoid other runtime conflicts.
 * :code:`min_repeat_ms` defines the minimum duration of one "repeat" in every measurement.
   This can warmup the GPU, which is necessary to get accurate measurement results.
   Typically, we recommend a value > 300 ms.
@@ -133,7 +132,7 @@ mainly specify how we do the measurement during the search and auto-tuning.
     log_file = "conv2d.json"
     measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
     tune_option = auto_scheduler.TuningOptions(
-        num_measure_trials=10,
+        num_measure_trials=10,  # change this to 1000 to achieve the best performance
         runner=measure_ctx.runner,
         measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
     )
@@ -201,1168 +200,104 @@ cooperative fetching, unrolling and operator fusion.
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     primfn(data_1: handle, kernel_1: handle, bias_1: handle, compute_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], []),
+      buffers = {bias: Buffer(bias_2: Pointer(float32), float32, [1, 512, 1, 1], []),
+                 compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], []),
                  kernel: Buffer(kernel_2: Pointer(float32), float32, [512, 512, 3, 3], []),
-                 bias: Buffer(bias_2: Pointer(float32), float32, [1, 512, 1, 1], []),
                  data: Buffer(data_2: Pointer(float32), float32, [1, 512, 7, 7], [])}
       buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} {
-      attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 64;
+      attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 224;
       attr [compute_3: Pointer(float32)] "storage_scope" = "local";
-      allocate(compute_3, float32, [7]);
+      allocate(compute_3, float32, [2]);
       attr [pad_temp.shared: Pointer(float32)] "storage_scope" = "shared";
-      allocate(pad_temp.shared, float32, [1296]);
+      allocate(pad_temp.shared, float32, [72]);
       attr [kernel.shared: Pointer(float32)] "storage_scope" = "shared";
-      allocate(kernel.shared, float32, [1152]);
+      allocate(kernel.shared, float32, [384]);
       attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56 {
         compute_3[0] = 0f32
         compute_3[1] = 0f32
-        compute_3[2] = 0f32
-        compute_3[3] = 0f32
-        compute_3[4] = 0f32
-        compute_3[5] = 0f32
-        compute_3[6] = 0f32
-        for (rc.outer.outer: int32, 0, 32) {
-          attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56 {
-            pad_temp.shared[(threadIdx.x_1*4)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1*4), 81)) && (floormod((threadIdx.x_1*4), 81) < 72)) && (1 <= floormod((threadIdx.x_1*4), 9))) && (floormod((threadIdx.x_1*4), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv((threadIdx.x_1*4), 81)*49)) + (floordiv(floormod((threadIdx.x_1*4), 81), 9)*7)) + floormod((threadIdx.x_1*4), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 1)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 1), 81)) && (floormod(((threadIdx.x_1*4) + 1), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 1), 9))) && (floormod(((threadIdx.x_1*4) + 1), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 1), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 1), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 1), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 2)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 2), 81)) && (floormod(((threadIdx.x_1*4) + 2), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 2), 9))) && (floormod(((threadIdx.x_1*4) + 2), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 2), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 2), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 2), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 3)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 3), 81)) && (floormod(((threadIdx.x_1*4) + 3), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 3), 9))) && (floormod(((threadIdx.x_1*4) + 3), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 3), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 3), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 3), 9)) - 8)], 0f32, dtype=float32)
-          }
-          attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56 {
-            pad_temp.shared[((threadIdx.x_1*4) + 224)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 62), 81)) && (floormod(((threadIdx.x_1*4) + 62), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 8), 9))) && (floormod(((threadIdx.x_1*4) + 8), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 224), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 62), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 8), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 225)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 63), 81)) && (floormod(((threadIdx.x_1*4) + 63), 81) < 72)) && (1 <= floormod((threadIdx.x_1*4), 9))) && (floormod((threadIdx.x_1*4), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 225), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 63), 81), 9)*7)) + floormod((threadIdx.x_1*4), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 226)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 64), 81)) && (floormod(((threadIdx.x_1*4) + 64), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 1), 9))) && (floormod(((threadIdx.x_1*4) + 1), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 226), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 64), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 1), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 227)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 65), 81)) && (floormod(((threadIdx.x_1*4) + 65), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 2), 9))) && (floormod(((threadIdx.x_1*4) + 2), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 227), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 65), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 2), 9)) - 8)], 0f32, dtype=float32)
-          }
-          attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56 {
-            pad_temp.shared[((threadIdx.x_1*4) + 448)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 43), 81)) && (floormod(((threadIdx.x_1*4) + 43), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 7), 9))) && (floormod(((threadIdx.x_1*4) + 7), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 448), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 43), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 7), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 449)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 44), 81)) && (floormod(((threadIdx.x_1*4) + 44), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 8), 9))) && (floormod(((threadIdx.x_1*4) + 8), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 449), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 44), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 8), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 450)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 45), 81)) && (floormod(((threadIdx.x_1*4) + 45), 81) < 72)) && (1 <= floormod((threadIdx.x_1*4), 9))) && (floormod((threadIdx.x_1*4), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 450), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 45), 81), 9)*7)) + floormod((threadIdx.x_1*4), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 451)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 46), 81)) && (floormod(((threadIdx.x_1*4) + 46), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 1), 9))) && (floormod(((threadIdx.x_1*4) + 1), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 451), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 46), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 1), 9)) - 8)], 0f32, dtype=float32)
-          }
-          attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56 {
-            pad_temp.shared[((threadIdx.x_1*4) + 672)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 24), 81)) && (floormod(((threadIdx.x_1*4) + 24), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 6), 9))) && (floormod(((threadIdx.x_1*4) + 6), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 672), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 24), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 6), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 673)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 25), 81)) && (floormod(((threadIdx.x_1*4) + 25), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 7), 9))) && (floormod(((threadIdx.x_1*4) + 7), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 673), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 25), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 7), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 674)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 26), 81)) && (floormod(((threadIdx.x_1*4) + 26), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 8), 9))) && (floormod(((threadIdx.x_1*4) + 8), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 674), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 26), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 8), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 675)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 27), 81)) && (floormod(((threadIdx.x_1*4) + 27), 81) < 72)) && (1 <= floormod((threadIdx.x_1*4), 9))) && (floormod((threadIdx.x_1*4), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 675), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 27), 81), 9)*7)) + floormod((threadIdx.x_1*4), 9)) - 8)], 0f32, dtype=float32)
-          }
-          attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56 {
-            pad_temp.shared[((threadIdx.x_1*4) + 896)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 5), 81)) && (floormod(((threadIdx.x_1*4) + 5), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 5), 9))) && (floormod(((threadIdx.x_1*4) + 5), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 896), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 5), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 5), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 897)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 6), 81)) && (floormod(((threadIdx.x_1*4) + 6), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 6), 9))) && (floormod(((threadIdx.x_1*4) + 6), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 897), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 6), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 6), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 898)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 7), 81)) && (floormod(((threadIdx.x_1*4) + 7), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 7), 9))) && (floormod(((threadIdx.x_1*4) + 7), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 898), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 7), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 7), 9)) - 8)], 0f32, dtype=float32)
-            pad_temp.shared[((threadIdx.x_1*4) + 899)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 8), 81)) && (floormod(((threadIdx.x_1*4) + 8), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 8), 9))) && (floormod(((threadIdx.x_1*4) + 8), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 899), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 8), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 8), 9)) - 8)], 0f32, dtype=float32)
-          }
-          attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56 {
-            if @tir.likely((threadIdx.x_1 < 44), dtype=bool) {
-              pad_temp.shared[((threadIdx.x_1*4) + 1120)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 67), 81)) && (floormod(((threadIdx.x_1*4) + 67), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 4), 9))) && (floormod(((threadIdx.x_1*4) + 4), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 1120), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 67), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 4), 9)) - 8)], 0f32, dtype=float32)
-            }
-            if @tir.likely(((threadIdx.x_1*4) < 175), dtype=bool) {
-              if @tir.likely((threadIdx.x_1 < 44), dtype=bool) {
-                pad_temp.shared[((threadIdx.x_1*4) + 1121)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 68), 81)) && (floormod(((threadIdx.x_1*4) + 68), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 5), 9))) && (floormod(((threadIdx.x_1*4) + 5), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 1121), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 68), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 5), 9)) - 8)], 0f32, dtype=float32)
-              }
+        for (rc.outer.outer: int32, 0, 64) {
+          for (rx.outer.outer: int32, 0, 3) {
+            attr [IterVar(threadIdx.x_1: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
+            pad_temp.shared[threadIdx.x_1] = @tir.if_then_else(((((1 <= floormod(threadIdx.x_1, 9)) && (floormod(threadIdx.x_1, 9) < 8)) && (1 <= (rx.outer.outer + floormod(blockIdx.x, 7)))) && ((rx.outer.outer + floormod(blockIdx.x, 7)) < 8)), (float32*)data_2[((((((rc.outer.outer*392) + (floordiv(threadIdx.x_1, 9)*49)) + (floormod(threadIdx.x_1, 9)*7)) + rx.outer.outer) + floormod(blockIdx.x, 7)) - 8)], 0f32, dtype=float32)
+            attr [IterVar(threadIdx.x_1, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
+            if @tir.likely((threadIdx.x_1 < 16), dtype=bool) {
+              pad_temp.shared[(threadIdx.x_1 + 56)] = @tir.if_then_else(((((1 <= floormod((threadIdx.x_1 + 2), 9)) && (floormod((threadIdx.x_1 + 2), 9) < 8)) && (1 <= (rx.outer.outer + floormod(blockIdx.x, 7)))) && ((rx.outer.outer + floormod(blockIdx.x, 7)) < 8)), (float32*)data_2[((((((rc.outer.outer*392) + (floordiv((threadIdx.x_1 + 56), 9)*49)) + (floormod((threadIdx.x_1 + 2), 9)*7)) + rx.outer.outer) + floormod(blockIdx.x, 7)) - 8)], 0f32, dtype=float32)
             }
-            if @tir.likely(((threadIdx.x_1*4) < 174), dtype=bool) {
-              if @tir.likely((threadIdx.x_1 < 44), dtype=bool) {
-                pad_temp.shared[((threadIdx.x_1*4) + 1122)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 69), 81)) && (floormod(((threadIdx.x_1*4) + 69), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 6), 9))) && (floormod(((threadIdx.x_1*4) + 6), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 1122), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 69), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 6), 9)) - 8)], 0f32, dtype=float32)
-              }
+            attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
+            kernel.shared[threadIdx.x_2] = (float32*)kernel_2[(((((floordiv(blockIdx.x, 7)*73728) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floormod(threadIdx.x_2, 24)*3)) + rx.outer.outer)]
+            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
+            kernel.shared[(threadIdx.x_2 + 56)] = (float32*)kernel_2[(((((floordiv(blockIdx.x, 7)*73728) + (floordiv((threadIdx.x_2 + 56), 24)*4608)) + (rc.outer.outer*72)) + (floormod((threadIdx.x_2 + 8), 24)*3)) + rx.outer.outer)]
+            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
+            kernel.shared[(threadIdx.x_2 + 112)] = (float32*)kernel_2[(((((floordiv(blockIdx.x, 7)*73728) + (floordiv((threadIdx.x_2 + 112), 24)*4608)) + (rc.outer.outer*72)) + (floormod((threadIdx.x_2 + 16), 24)*3)) + rx.outer.outer)]
+            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
+            kernel.shared[(threadIdx.x_2 + 168)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*73728) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floormod(threadIdx.x_2, 24)*3)) + rx.outer.outer) + 32256)]
+            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
+            kernel.shared[(threadIdx.x_2 + 224)] = (float32*)kernel_2[(((((floordiv(blockIdx.x, 7)*73728) + (floordiv((threadIdx.x_2 + 224), 24)*4608)) + (rc.outer.outer*72)) + (floormod((threadIdx.x_2 + 8), 24)*3)) + rx.outer.outer)]
+            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
+            kernel.shared[(threadIdx.x_2 + 280)] = (float32*)kernel_2[(((((floordiv(blockIdx.x, 7)*73728) + (floordiv((threadIdx.x_2 + 280), 24)*4608)) + (rc.outer.outer*72)) + (floormod((threadIdx.x_2 + 16), 24)*3)) + rx.outer.outer)]
+            attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
+            if @tir.likely((threadIdx.x_2 < 48), dtype=bool) {
+              kernel.shared[(threadIdx.x_2 + 336)] = (float32*)kernel_2[((((((floordiv(blockIdx.x, 7)*73728) + (floordiv(threadIdx.x_2, 24)*4608)) + (rc.outer.outer*72)) + (floormod(threadIdx.x_2, 24)*3)) + rx.outer.outer) + 64512)]
             }
-            if @tir.likely(((threadIdx.x_1*4) < 173), dtype=bool) {
-              if @tir.likely((threadIdx.x_1 < 44), dtype=bool) {
-                pad_temp.shared[((threadIdx.x_1*4) + 1123)] = @tir.if_then_else(((((9 <= floormod(((threadIdx.x_1*4) + 70), 81)) && (floormod(((threadIdx.x_1*4) + 70), 81) < 72)) && (1 <= floormod(((threadIdx.x_1*4) + 7), 9))) && (floormod(((threadIdx.x_1*4) + 7), 9) < 8)), (float32*)data_2[(((((rc.outer.outer*784) + (floordiv(((threadIdx.x_1*4) + 1123), 81)*49)) + (floordiv(floormod(((threadIdx.x_1*4) + 70), 81), 9)*7)) + floormod(((threadIdx.x_1*4) + 7), 9)) - 8)], 0f32, dtype=float32)
-              }
-            }
-          }
-          attr [IterVar(threadIdx.x_2: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[threadIdx.x_2] = (float32*)kernel_2[(((blockIdx.x*36864) + (rc.outer.outer*144)) + threadIdx.x_2)]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 56)] = (float32*)kernel_2[(((blockIdx.x*36864) + (rc.outer.outer*144)) + (threadIdx.x_2 + 56))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 112)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 112), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 112), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 168)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 168), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 24), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 224)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 224), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 80), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 280)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 280), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 136), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 336)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 336), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 48), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 392)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 392), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 104), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 448)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 448), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 16), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 504)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 504), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 72), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 560)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 560), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 128), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 616)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 616), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 40), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 672)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 672), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 96), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 728)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 728), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 8), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 784)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 784), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 64), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 840)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 840), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 120), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 896)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 896), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 32), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 952)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 952), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 88), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 1008)] = (float32*)kernel_2[((((blockIdx.x*36864) + (rc.outer.outer*144)) + threadIdx.x_2) + 32256)]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          kernel.shared[(threadIdx.x_2 + 1064)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1064), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 56), 144))]
-          attr [IterVar(threadIdx.x_2, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 56;
-          if @tir.likely((threadIdx.x_2 < 32), dtype=bool) {
-            kernel.shared[(threadIdx.x_2 + 1120)] = (float32*)kernel_2[((((blockIdx.x*36864) + (floordiv((threadIdx.x_2 + 1120), 144)*4608)) + (rc.outer.outer*144)) + floormod((threadIdx.x_2 + 112), 144))]
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[floormod(threadIdx.x, 7)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*48)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 9)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 3)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[floormod(threadIdx.x, 7)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 24)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 9)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 27)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 1)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 1)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 4)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 1)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 25)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 28)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 2)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 5)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 26)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 29)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 18)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 6)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 27)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 9)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 18)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 30)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 27)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 33)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 7)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 28)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 10)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 31)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 28)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 34)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 8)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 29)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 11)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 32)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 29)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 35)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 36)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 12)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 45)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 15)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 36)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 36)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 45)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 39)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 37)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 13)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 46)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 16)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 37)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 37)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 46)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 40)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 38)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 14)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 47)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 17)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 38)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 38)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 47)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 41)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 54)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 18)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 63)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 21)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 54)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 42)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 63)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 45)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 55)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 19)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 64)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 22)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 55)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 43)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 64)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 46)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 56)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 20)]))
+            compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 65)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 23)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 56)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 44)]))
+            compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7) + 65)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*48) + 47)]))
           }
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(floormod(threadIdx.x, 7)*9)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*144)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*144)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*144)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*144)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*144)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*144)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[(floordiv(threadIdx.x, 7)*144)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 9)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 3)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 3)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 3)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 3)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 3)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 3)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 3)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 18)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 6)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 6)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 6)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 6)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 6)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 6)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 6)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 1)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 1)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 1)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 1)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 1)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 1)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 7)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 1)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 10)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 4)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 4)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 4)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 4)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 4)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 4)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 16)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 4)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 19)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 7)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 7)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 7)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 7)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 7)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 7)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 25)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 7)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 2)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 2)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 3)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 2)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 4)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 2)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 5)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 2)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 6)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 2)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 7)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 2)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 8)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 2)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 11)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 5)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 12)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 5)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 13)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 5)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 14)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 5)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 15)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 5)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 16)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 5)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 17)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 5)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 20)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 8)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 21)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 8)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 22)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 8)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 23)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 8)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 24)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 8)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 25)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 8)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 26)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 8)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 81)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 9)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 82)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 9)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 9)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 9)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 9)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 9)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 9)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 90)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 12)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 91)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 12)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 12)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 12)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 12)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 12)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 12)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 99)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 15)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 100)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 15)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 15)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 15)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 15)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 15)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 15)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 82)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 10)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 10)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 10)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 10)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 10)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 10)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 88)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 10)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 91)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 13)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 13)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 13)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 13)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 13)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 13)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 97)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 13)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 100)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 16)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 16)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 16)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 16)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 16)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 16)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 106)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 16)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 83)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 11)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 84)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 11)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 85)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 11)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 86)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 11)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 87)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 11)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 88)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 11)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 89)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 11)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 92)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 14)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 93)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 14)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 94)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 14)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 95)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 14)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 96)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 14)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 97)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 14)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 98)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 14)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 101)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 17)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 102)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 17)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 103)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 17)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 104)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 17)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 105)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 17)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 106)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 17)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 107)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 17)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 162)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 18)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 163)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 18)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 164)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 18)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 165)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 18)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 166)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 18)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 167)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 18)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 168)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 18)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 171)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 21)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 172)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 21)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 173)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 21)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 174)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 21)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 175)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 21)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 176)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 21)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 177)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 21)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 180)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 24)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 181)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 24)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 182)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 24)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 183)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 24)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 184)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 24)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 185)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 24)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 186)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 24)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 163)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 19)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 164)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 19)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 165)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 19)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 166)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 19)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 167)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 19)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 168)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 19)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 169)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 19)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 172)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 22)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 173)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 22)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 174)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 22)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 175)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 22)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 176)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 22)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 177)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 22)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 178)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 22)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 181)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 25)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 182)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 25)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 183)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 25)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 184)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 25)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 185)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 25)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 186)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 25)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 187)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 25)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 164)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 20)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 165)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 20)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 166)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 20)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 167)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 20)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 168)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 20)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 169)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 20)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 170)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 20)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 173)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 23)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 174)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 23)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 175)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 23)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 176)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 23)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 177)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 23)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 178)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 23)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 179)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 23)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 182)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 26)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 183)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 26)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 184)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 26)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 185)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 26)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 186)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 26)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 187)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 26)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 188)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 26)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 243)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 27)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 244)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 27)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 245)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 27)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 246)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 27)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 247)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 27)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 248)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 27)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 249)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 27)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 252)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 30)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 253)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 30)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 254)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 30)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 255)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 30)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 256)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 30)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 257)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 30)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 258)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 30)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 261)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 33)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 262)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 33)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 263)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 33)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 264)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 33)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 265)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 33)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 266)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 33)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 267)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 33)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 244)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 28)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 245)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 28)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 246)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 28)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 247)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 28)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 248)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 28)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 249)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 28)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 250)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 28)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 253)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 31)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 254)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 31)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 255)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 31)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 256)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 31)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 257)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 31)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 258)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 31)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 259)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 31)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 262)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 34)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 263)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 34)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 264)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 34)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 265)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 34)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 266)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 34)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 267)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 34)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 268)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 34)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 245)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 29)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 246)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 29)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 247)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 29)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 248)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 29)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 249)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 29)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 250)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 29)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 251)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 29)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 254)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 32)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 255)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 32)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 256)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 32)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 257)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 32)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 258)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 32)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 259)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 32)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 260)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 32)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 263)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 35)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 264)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 35)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 265)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 35)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 266)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 35)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 267)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 35)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 268)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 35)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 269)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 35)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 324)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 36)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 325)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 36)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 326)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 36)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 327)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 36)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 328)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 36)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 329)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 36)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 330)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 36)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 333)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 39)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 334)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 39)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 335)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 39)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 336)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 39)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 337)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 39)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 338)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 39)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 339)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 39)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 342)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 42)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 343)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 42)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 344)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 42)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 345)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 42)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 346)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 42)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 347)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 42)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 348)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 42)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 325)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 37)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 326)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 37)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 327)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 37)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 328)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 37)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 329)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 37)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 330)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 37)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 331)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 37)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 334)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 40)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 335)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 40)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 336)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 40)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 337)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 40)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 338)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 40)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 339)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 40)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 340)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 40)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 343)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 43)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 344)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 43)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 345)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 43)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 346)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 43)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 347)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 43)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 348)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 43)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 349)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 43)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 326)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 38)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 327)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 38)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 328)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 38)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 329)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 38)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 330)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 38)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 331)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 38)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 332)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 38)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 335)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 41)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 336)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 41)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 337)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 41)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 338)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 41)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 339)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 41)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 340)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 41)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 341)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 41)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 344)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 44)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 345)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 44)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 346)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 44)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 347)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 44)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 348)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 44)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 349)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 44)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 350)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 44)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 405)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 45)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 406)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 45)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 407)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 45)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 408)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 45)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 409)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 45)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 410)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 45)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 411)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 45)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 414)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 48)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 415)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 48)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 416)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 48)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 417)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 48)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 418)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 48)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 419)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 48)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 420)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 48)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 423)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 51)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 424)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 51)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 425)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 51)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 426)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 51)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 427)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 51)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 428)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 51)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 429)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 51)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 406)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 46)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 407)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 46)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 408)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 46)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 409)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 46)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 410)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 46)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 411)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 46)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 412)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 46)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 415)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 49)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 416)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 49)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 417)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 49)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 418)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 49)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 419)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 49)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 420)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 49)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 421)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 49)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 424)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 52)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 425)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 52)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 426)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 52)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 427)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 52)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 428)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 52)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 429)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 52)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 430)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 52)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 407)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 47)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 408)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 47)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 409)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 47)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 410)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 47)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 411)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 47)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 412)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 47)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 413)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 47)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 416)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 50)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 417)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 50)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 418)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 50)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 419)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 50)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 420)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 50)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 421)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 50)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 422)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 50)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 425)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 53)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 426)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 53)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 427)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 53)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 428)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 53)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 429)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 53)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 430)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 53)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 431)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 53)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 486)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 54)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 487)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 54)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 488)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 54)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 489)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 54)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 490)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 54)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 491)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 54)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 492)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 54)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 495)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 57)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 496)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 57)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 497)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 57)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 498)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 57)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 499)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 57)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 500)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 57)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 501)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 57)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 504)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 60)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 505)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 60)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 506)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 60)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 507)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 60)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 508)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 60)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 509)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 60)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 510)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 60)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 487)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 55)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 488)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 55)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 489)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 55)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 490)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 55)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 491)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 55)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 492)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 55)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 493)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 55)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 496)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 58)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 497)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 58)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 498)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 58)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 499)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 58)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 500)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 58)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 501)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 58)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 502)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 58)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 505)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 61)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 506)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 61)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 507)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 61)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 508)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 61)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 509)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 61)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 510)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 61)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 511)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 61)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 488)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 56)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 489)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 56)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 490)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 56)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 491)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 56)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 492)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 56)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 493)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 56)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 494)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 56)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 497)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 59)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 498)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 59)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 499)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 59)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 500)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 59)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 501)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 59)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 502)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 59)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 503)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 59)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 506)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 62)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 507)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 62)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 508)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 62)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 509)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 62)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 510)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 62)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 511)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 62)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 512)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 62)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 567)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 63)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 568)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 63)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 569)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 63)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 570)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 63)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 571)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 63)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 572)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 63)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 573)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 63)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 576)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 66)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 577)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 66)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 578)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 66)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 579)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 66)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 580)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 66)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 581)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 66)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 582)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 66)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 585)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 69)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 586)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 69)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 587)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 69)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 588)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 69)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 589)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 69)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 590)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 69)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 591)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 69)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 568)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 64)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 569)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 64)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 570)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 64)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 571)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 64)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 572)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 64)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 573)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 64)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 574)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 64)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 577)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 67)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 578)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 67)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 579)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 67)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 580)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 67)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 581)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 67)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 582)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 67)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 583)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 67)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 586)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 70)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 587)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 70)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 588)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 70)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 589)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 70)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 590)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 70)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 591)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 70)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 592)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 70)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 569)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 65)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 570)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 65)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 571)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 65)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 572)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 65)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 573)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 65)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 574)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 65)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 575)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 65)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 578)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 68)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 579)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 68)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 580)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 68)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 581)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 68)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 582)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 68)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 583)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 68)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 584)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 68)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 587)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 71)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 588)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 71)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 589)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 71)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 590)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 71)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 591)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 71)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 592)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 71)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 593)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 71)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 648)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 72)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 649)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 72)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 650)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 72)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 651)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 72)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 652)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 72)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 653)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 72)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 654)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 72)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 657)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 75)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 658)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 75)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 659)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 75)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 660)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 75)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 661)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 75)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 662)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 75)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 663)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 75)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 666)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 78)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 667)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 78)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 668)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 78)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 669)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 78)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 670)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 78)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 671)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 78)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 672)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 78)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 649)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 73)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 650)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 73)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 651)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 73)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 652)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 73)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 653)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 73)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 654)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 73)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 655)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 73)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 658)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 76)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 659)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 76)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 660)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 76)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 661)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 76)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 662)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 76)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 663)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 76)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 664)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 76)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 667)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 79)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 668)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 79)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 669)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 79)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 670)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 79)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 671)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 79)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 672)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 79)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 673)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 79)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 650)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 74)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 651)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 74)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 652)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 74)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 653)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 74)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 654)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 74)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 655)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 74)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 656)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 74)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 659)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 77)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 660)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 77)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 661)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 77)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 662)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 77)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 663)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 77)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 664)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 77)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 665)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 77)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 668)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 80)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 669)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 80)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 670)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 80)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 671)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 80)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 672)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 80)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 673)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 80)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 674)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 80)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 729)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 81)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 730)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 81)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 731)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 81)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 732)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 81)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 733)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 81)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 734)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 81)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 735)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 81)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 738)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 84)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 739)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 84)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 740)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 84)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 741)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 84)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 742)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 84)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 743)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 84)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 744)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 84)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 747)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 87)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 748)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 87)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 749)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 87)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 750)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 87)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 751)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 87)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 752)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 87)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 753)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 87)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 730)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 82)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 731)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 82)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 732)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 82)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 733)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 82)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 734)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 82)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 735)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 82)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 736)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 82)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 739)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 85)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 740)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 85)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 741)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 85)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 742)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 85)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 743)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 85)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 744)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 85)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 745)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 85)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 748)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 88)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 749)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 88)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 750)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 88)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 751)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 88)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 752)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 88)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 753)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 88)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 754)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 88)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 731)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 83)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 732)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 83)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 733)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 83)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 734)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 83)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 735)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 83)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 736)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 83)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 737)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 83)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 740)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 86)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 741)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 86)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 742)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 86)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 743)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 86)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 744)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 86)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 745)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 86)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 746)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 86)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 749)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 89)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 750)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 89)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 751)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 89)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 752)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 89)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 753)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 89)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 754)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 89)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 755)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 89)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 810)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 90)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 811)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 90)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 812)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 90)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 813)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 90)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 814)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 90)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 815)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 90)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 816)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 90)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 819)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 93)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 820)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 93)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 821)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 93)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 822)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 93)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 823)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 93)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 824)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 93)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 825)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 93)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 828)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 96)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 829)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 96)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 830)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 96)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 831)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 96)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 832)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 96)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 833)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 96)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 834)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 96)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 811)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 91)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 812)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 91)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 813)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 91)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 814)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 91)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 815)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 91)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 816)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 91)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 817)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 91)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 820)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 94)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 821)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 94)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 822)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 94)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 823)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 94)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 824)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 94)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 825)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 94)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 826)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 94)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 829)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 97)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 830)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 97)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 831)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 97)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 832)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 97)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 833)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 97)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 834)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 97)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 835)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 97)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 812)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 92)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 813)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 92)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 814)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 92)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 815)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 92)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 816)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 92)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 817)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 92)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 818)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 92)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 821)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 95)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 822)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 95)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 823)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 95)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 824)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 95)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 825)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 95)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 826)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 95)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 827)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 95)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 830)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 98)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 831)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 98)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 832)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 98)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 833)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 98)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 834)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 98)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 835)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 98)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 836)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 98)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 891)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 99)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 892)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 99)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 893)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 99)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 894)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 99)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 895)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 99)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 896)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 99)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 897)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 99)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 900)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 102)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 901)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 102)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 902)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 102)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 903)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 102)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 904)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 102)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 905)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 102)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 906)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 102)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 909)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 105)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 910)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 105)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 911)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 105)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 912)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 105)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 913)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 105)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 914)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 105)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 915)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 105)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 892)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 100)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 893)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 100)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 894)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 100)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 895)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 100)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 896)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 100)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 897)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 100)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 898)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 100)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 901)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 103)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 902)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 103)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 903)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 103)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 904)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 103)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 905)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 103)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 906)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 103)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 907)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 103)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 910)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 106)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 911)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 106)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 912)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 106)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 913)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 106)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 914)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 106)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 915)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 106)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 916)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 106)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 893)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 101)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 894)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 101)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 895)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 101)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 896)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 101)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 897)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 101)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 898)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 101)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 899)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 101)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 902)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 104)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 903)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 104)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 904)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 104)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 905)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 104)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 906)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 104)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 907)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 104)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 908)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 104)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 911)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 107)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 912)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 107)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 913)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 107)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 914)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 107)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 915)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 107)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 916)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 107)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 917)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 107)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 972)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 108)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 973)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 108)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 974)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 108)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 975)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 108)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 976)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 108)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 977)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 108)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 978)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 108)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 981)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 111)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 982)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 111)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 983)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 111)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 984)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 111)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 985)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 111)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 986)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 111)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 987)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 111)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 990)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 114)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 991)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 114)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 992)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 114)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 993)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 114)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 994)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 114)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 995)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 114)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 996)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 114)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 973)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 109)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 974)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 109)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 975)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 109)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 976)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 109)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 977)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 109)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 978)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 109)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 979)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 109)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 982)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 112)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 983)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 112)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 984)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 112)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 985)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 112)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 986)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 112)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 987)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 112)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 988)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 112)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 991)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 115)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 992)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 115)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 993)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 115)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 994)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 115)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 995)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 115)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 996)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 115)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 997)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 115)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 974)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 110)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 975)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 110)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 976)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 110)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 977)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 110)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 978)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 110)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 979)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 110)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 980)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 110)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 983)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 113)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 984)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 113)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 985)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 113)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 986)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 113)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 987)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 113)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 988)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 113)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 989)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 113)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 992)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 116)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 993)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 116)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 994)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 116)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 995)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 116)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 996)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 116)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 997)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 116)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 998)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 116)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1053)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 117)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1054)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 117)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1055)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 117)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1056)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 117)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1057)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 117)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1058)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 117)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1059)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 117)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1062)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 120)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1063)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 120)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1064)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 120)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1065)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 120)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1066)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 120)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1067)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 120)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1068)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 120)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1071)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 123)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1072)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 123)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1073)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 123)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1074)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 123)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1075)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 123)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1076)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 123)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1077)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 123)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1054)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 118)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1055)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 118)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1056)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 118)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1057)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 118)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1058)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 118)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1059)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 118)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1060)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 118)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1063)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 121)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1064)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 121)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1065)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 121)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1066)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 121)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1067)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 121)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1068)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 121)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1069)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 121)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1072)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 124)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1073)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 124)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1074)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 124)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1075)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 124)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1076)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 124)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1077)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 124)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1078)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 124)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1055)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 119)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1056)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 119)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1057)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 119)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1058)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 119)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1059)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 119)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1060)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 119)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1061)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 119)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1064)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 122)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1065)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 122)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1066)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 122)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1067)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 122)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1068)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 122)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1069)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 122)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1070)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 122)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1073)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 125)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1074)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 125)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1075)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 125)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1076)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 125)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1077)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 125)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1078)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 125)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1079)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 125)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1134)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 126)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1135)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 126)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1136)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 126)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1137)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 126)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1138)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 126)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1139)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 126)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1140)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 126)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1143)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 129)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1144)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 129)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1145)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 129)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1146)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 129)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1147)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 129)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1148)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 129)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1149)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 129)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1152)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 132)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1153)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 132)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1154)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 132)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1155)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 132)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1156)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 132)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1157)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 132)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1158)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 132)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1135)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 127)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1136)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 127)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1137)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 127)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1138)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 127)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1139)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 127)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1140)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 127)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1141)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 127)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1144)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 130)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1145)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 130)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1146)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 130)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1147)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 130)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1148)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 130)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1149)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 130)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1150)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 130)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1153)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 133)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1154)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 133)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1155)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 133)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1156)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 133)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1157)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 133)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1158)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 133)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1159)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 133)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1136)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 128)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1137)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 128)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1138)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 128)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1139)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 128)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1140)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 128)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1141)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 128)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1142)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 128)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1145)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 131)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1146)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 131)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1147)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 131)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1148)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 131)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1149)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 131)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1150)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 131)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1151)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 131)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1154)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 134)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1155)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 134)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1156)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 134)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1157)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 134)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1158)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 134)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1159)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 134)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1160)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 134)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1215)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 135)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1216)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 135)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1217)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 135)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1218)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 135)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1219)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 135)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1220)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 135)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1221)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 135)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1224)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 138)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1225)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 138)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1226)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 138)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1227)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 138)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1228)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 138)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1229)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 138)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1230)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 138)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1233)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 141)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1234)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 141)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1235)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 141)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1236)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 141)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1237)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 141)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1238)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 141)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1239)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 141)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1216)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 136)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1217)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 136)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1218)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 136)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1219)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 136)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1220)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 136)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1221)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 136)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1222)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 136)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1225)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 139)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1226)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 139)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1227)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 139)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1228)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 139)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1229)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 139)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1230)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 139)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1231)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 139)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1234)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 142)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1235)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 142)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1236)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 142)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1237)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 142)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1238)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 142)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1239)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 142)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1240)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 142)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1217)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 137)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1218)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 137)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1219)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 137)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1220)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 137)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1221)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 137)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1222)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 137)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1223)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 137)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1226)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 140)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1227)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 140)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1228)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 140)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1229)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 140)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1230)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 140)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1231)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 140)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1232)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 140)]))
-          compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1235)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 143)]))
-          compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1236)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 143)]))
-          compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1237)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 143)]))
-          compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1238)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 143)]))
-          compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1239)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 143)]))
-          compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1240)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 143)]))
-          compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[((floormod(threadIdx.x, 7)*9) + 1241)]*(float32*)kernel.shared[((floordiv(threadIdx.x, 7)*144) + 143)]))
         }
-        for (i3.inner: int32, 0, 7) {
-          compute_2[(((blockIdx.x*392) + (threadIdx.x*7)) + i3.inner)] = max(((float32*)compute_3[i3.inner] + (float32*)bias_2[((blockIdx.x*8) + floordiv(threadIdx.x, 7))]), 0f32)
+        for (i1.inner: int32, 0, 2) {
+          compute_2[(((((floordiv(blockIdx.x, 7)*784) + (floordiv(threadIdx.x, 7)*98)) + (i1.inner*49)) + (floormod(threadIdx.x, 7)*7)) + floormod(blockIdx.x, 7))] = max(((float32*)compute_3[i1.inner] + (float32*)bias_2[(((floordiv(blockIdx.x, 7)*16) + (floordiv(threadIdx.x, 7)*2)) + i1.inner)]), 0f32)
         }
       }
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "IntImm"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "bool", 
-            "value": "1"
-          }
-        }
-      ], 
-      "b64ndarrays": [], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
@@ -1410,7 +345,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 0.201 ms
+    Execution time of this operator: 0.364 ms
 
 
 
@@ -1461,34 +396,34 @@ print the equivalent python schedule API, and build the binary again.
     compute_nn_o_o_o_i, compute_nn_o_o_i = s[compute].split(compute_nn_o_o_i, factor=1)
     compute_nn_o_o_o_o, compute_nn_o_o_o_i = s[compute].split(compute_nn_o_o_o_i, factor=1)
     compute_ff_o_i, compute_ff_i = s[compute].split(compute_ff, factor=1)
-    compute_ff_o_o_i, compute_ff_o_i = s[compute].split(compute_ff_o_i, factor=1)
+    compute_ff_o_o_i, compute_ff_o_i = s[compute].split(compute_ff_o_i, factor=2)
     compute_ff_o_o_o_i, compute_ff_o_o_i = s[compute].split(compute_ff_o_o_i, factor=8)
     compute_ff_o_o_o_o, compute_ff_o_o_o_i = s[compute].split(compute_ff_o_o_o_i, factor=1)
     compute_yy_o_i, compute_yy_i = s[compute].split(compute_yy, factor=1)
     compute_yy_o_o_i, compute_yy_o_i = s[compute].split(compute_yy_o_i, factor=1)
     compute_yy_o_o_o_i, compute_yy_o_o_i = s[compute].split(compute_yy_o_o_i, factor=7)
     compute_yy_o_o_o_o, compute_yy_o_o_o_i = s[compute].split(compute_yy_o_o_o_i, factor=1)
-    compute_xx_o_i, compute_xx_i = s[compute].split(compute_xx, factor=7)
+    compute_xx_o_i, compute_xx_i = s[compute].split(compute_xx, factor=1)
     compute_xx_o_o_i, compute_xx_o_i = s[compute].split(compute_xx_o_i, factor=1)
     compute_xx_o_o_o_i, compute_xx_o_o_i = s[compute].split(compute_xx_o_o_i, factor=1)
     compute_xx_o_o_o_o, compute_xx_o_o_o_i = s[compute].split(compute_xx_o_o_o_i, factor=1)
-    compute_rc_o_i, compute_rc_i = s[compute].split(compute_rc, factor=1)
-    compute_rc_o_o, compute_rc_o_i = s[compute].split(compute_rc_o_i, factor=16)
-    compute_ry_o_i, compute_ry_i = s[compute].split(compute_ry, factor=3)
-    compute_ry_o_o, compute_ry_o_i = s[compute].split(compute_ry_o_i, factor=1)
+    compute_rc_o_i, compute_rc_i = s[compute].split(compute_rc, factor=2)
+    compute_rc_o_o, compute_rc_o_i = s[compute].split(compute_rc_o_i, factor=4)
+    compute_ry_o_i, compute_ry_i = s[compute].split(compute_ry, factor=1)
+    compute_ry_o_o, compute_ry_o_i = s[compute].split(compute_ry_o_i, factor=3)
     compute_rx_o_i, compute_rx_i = s[compute].split(compute_rx, factor=1)
-    compute_rx_o_o, compute_rx_o_i = s[compute].split(compute_rx_o_i, factor=3)
+    compute_rx_o_o, compute_rx_o_i = s[compute].split(compute_rx_o_i, factor=1)
     s[compute].reorder(compute_nn_o_o_o_o, compute_ff_o_o_o_o, compute_yy_o_o_o_o, compute_xx_o_o_o_o, compute_nn_o_o_o_i, compute_ff_o_o_o_i, compute_yy_o_o_o_i, compute_xx_o_o_o_i, compute_nn_o_o_i, compute_ff_o_o_i, compute_yy_o_o_i, compute_xx_o_o_i, compute_rc_o_o, compute_ry_o_o, compute_rx_o_o, compute_rc_o_i, compute_ry_o_i, compute_rx_o_i, compute_nn_o_i, compute_ff_o_i, compute_yy_o_i, compute_xx_o_i, compute_rc_i, compute_ry_i, compute_rx_i, compute_nn_i, compute_ff_i, compute [...]
     compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
     compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
     compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
-    compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=1)
+    compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=2)
     compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=8)
     compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=1)
     compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
     compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
     compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
-    compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=7)
+    compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=1)
     compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)
     compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
     s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
@@ -1512,11 +447,11 @@ print the equivalent python schedule API, and build the binary again.
     kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=56)
     s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
     pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
-    pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=4)
+    pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
     s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
     pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=56)
     s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis("threadIdx.x"))
-    s[compute].pragma(compute_nn_o_o_o_o, "auto_unroll_max_step", 1024)
+    s[compute].pragma(compute_nn_o_o_o_o, "auto_unroll_max_step", 64)
     s[compute].pragma(compute_nn_o_o_o_o, "unroll_explicit", True)
 
 
@@ -1531,7 +466,6 @@ In the example below we resume the status and do more 5 trials.
 .. code-block:: default
 
 
-
     cost_model = auto_scheduler.XGBModel()
     cost_model.update_from_file(log_file)
     search_policy = auto_scheduler.SketchPolicy(
@@ -1565,7 +499,7 @@ In the example below we resume the status and do more 5 trials.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  40.668 seconds)
+   **Total running time of the script:** ( 2 minutes  46.397 seconds)
 
 
 .. _sphx_glr_download_tutorials_auto_scheduler_tune_conv2d_layer_cuda.py:
diff --git a/docs/_sources/tutorials/auto_scheduler/tune_matmul_x86.rst.txt b/docs/_sources/tutorials/auto_scheduler/tune_matmul_x86.rst.txt
index 8abaac4..ef4ea6f 100644
--- a/docs/_sources/tutorials/auto_scheduler/tune_matmul_x86.rst.txt
+++ b/docs/_sources/tutorials/auto_scheduler/tune_matmul_x86.rst.txt
@@ -11,7 +11,7 @@ Auto-scheduling matrix multiplication for CPU
 =============================================
 **Author**: `Lianmin Zheng <https://github.com/merrymercy>`_,             `Chengfan Jia <https://github.com/jcf94/>`_
 
-Different from the existing :ref:`autotvm <tutorials-autotvm-sec>` which relies on 
+Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which relies on
 manual templates to define the search space, the auto-scheduler does not require any templates.
 Users only need to write the computation declaration without any schedule commands or templates.
 The auto-scheduler can automatically generate a large search space and
@@ -153,7 +153,7 @@ After some measurement trials, it will return the best schedule it found.
 
  .. code-block:: none
 
-    *T*T*T*T*T*T*T*T*T
+    *T*T*T*T*T*T*T*T*T*T
 
 
 
@@ -177,75 +177,32 @@ parallelization, vectorization, unrolling and operator fusion.
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     primfn(A_1: handle, B_1: handle, C_1: handle, out_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
       buffers = {out: Buffer(out_2: Pointer(float32), float32, [128, 128], []),
-                 B: Buffer(B_2: Pointer(float32), float32, [128, 128], []),
                  C: Buffer(C_2: Pointer(float32), float32, [128, 128], []),
+                 B: Buffer(B_2: Pointer(float32), float32, [128, 128], []),
                  A: Buffer(A_2: Pointer(float32), float32, [128, 128], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C, out_1: out} {
       attr [matmul: Pointer(float32)] "storage_scope" = "global";
       allocate(matmul, float32, [16384]) {
-        for (i.outer.outer.inner: int32, 0, 2) {
-          for (j.outer.outer.inner: int32, 0, 2) {
-            for (j.outer.inner.init: int32, 0, 2) {
-              for (i.inner.init: int32, 0, 64) {
-                for (j.inner.init: int32, 0, 32) {
-                  matmul[(((((i.outer.outer.inner*8192) + (i.inner.init*128)) + (j.outer.outer.inner*64)) + (j.outer.inner.init*32)) + j.inner.init)] = 0f32
-                }
-              }
-            }
-            for (k.outer: int32, 0, 4) {
-              for (j.outer.inner: int32, 0, 2) {
-                for (k.inner: int32, 0, 32) {
-                  for (i.inner: int32, 0, 64) {
-                    for (j.inner: int32, 0, 32) {
-                      matmul[(((((i.outer.outer.inner*8192) + (i.inner*128)) + (j.outer.outer.inner*64)) + (j.outer.inner*32)) + j.inner)] = ((float32*)matmul[(((((i.outer.outer.inner*8192) + (i.inner*128)) + (j.outer.outer.inner*64)) + (j.outer.inner*32)) + j.inner)] + ((float32*)A_2[((((i.outer.outer.inner*8192) + (i.inner*128)) + (k.outer*32)) + k.inner)]*(float32*)B_2[(((((k.outer*4096) + (k.inner*128)) + (j.outer.outer.inner*64)) + (j.outer.inner*32)) + j.inner)]))
-                    }
-                  }
-                }
-              }
+        for (i: int32, 0, 128) {
+          for (j: int32, 0, 128) {
+            matmul[((i*128) + j)] = 0f32
+            for (k: int32, 0, 128) {
+              matmul[((i*128) + j)] = ((float32*)matmul[((i*128) + j)] + ((float32*)A_2[((i*128) + k)]*(float32*)B_2[((k*128) + j)]))
             }
           }
         }
-        for (i.inner_1: int32, 0, 128) {
-          for (j.inner_1: int32, 0, 128) {
-            out_2[((i.inner_1*128) + j.inner_1)] = ((float32*)matmul[((i.inner_1*128) + j.inner_1)] + (float32*)C_2[((i.inner_1*128) + j.inner_1)])
+        for (i_1: int32, 0, 128) {
+          for (j_1: int32, 0, 128) {
+            out_2[((i_1*128) + j_1)] = ((float32*)matmul[((i_1*128) + j_1)] + (float32*)C_2[((i_1*128) + j_1)])
           }
         }
       }
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "IntImm"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "bool", 
-            "value": "1"
-          }
-        }
-      ], 
-      "b64ndarrays": [], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
@@ -291,7 +248,7 @@ We build the binary and check its correctness and performance.
 
  .. code-block:: none
 
-    Execution time of this operator: 0.229 ms
+    Execution time of this operator: 2.209 ms
 
 
 
@@ -407,7 +364,7 @@ In the example below we resume the status and do more 5 trials.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  44.820 seconds)
+   **Total running time of the script:** ( 1 minutes  54.000 seconds)
 
 
 .. _sphx_glr_download_tutorials_auto_scheduler_tune_matmul_x86.py:
diff --git a/docs/_sources/tutorials/auto_scheduler/tune_network_cuda.rst.txt b/docs/_sources/tutorials/auto_scheduler/tune_network_cuda.rst.txt
new file mode 100644
index 0000000..e993bac
--- /dev/null
+++ b/docs/_sources/tutorials/auto_scheduler/tune_network_cuda.rst.txt
@@ -0,0 +1,381 @@
+.. note::
+    :class: sphx-glr-download-link-note
+
+    Click :ref:`here <sphx_glr_download_tutorials_auto_scheduler_tune_network_cuda.py>` to download the full example code
+.. rst-class:: sphx-glr-example-title
+
+.. _sphx_glr_tutorials_auto_scheduler_tune_network_cuda.py:
+
+
+Auto-tuning a Neural Network for NVIDIA GPU
+===========================================
+**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_
+
+Auto-tuning for specific devices and workloads is critical for getting the
+best performance. This is a tutorial on how to tune a whole neural
+network for NVIDIA GPU with the auto-scheduler.
+
+To auto-tune a neural network, we partition the network into small subgraphs and 
+tune them independently. Each subgraph is treated as one search task.
+A task scheduler slices the time and dynamically allocates time resources to
+these tasks. The task scheduler predicts the impact of each task on the end-to-end
+execution time and prioritizes the one that can reduce the execution time the most.
+
+For each subgraph, we use the compute declaration in :code:`tvm/python/topi` to
+get the computational DAG in the tensor expression form.
+We then use the auto-scheduler to construct a search space of this DAG and search
+for good schedules (low-level optimizations).
+
+Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which relies on
+manual templates to define the search space, the auto-scheduler does not require any
+schedule templates. In other words, the auto-scheduler only uses the compute declarations
+in :code:`tvm/python/topi` while does not use existing schedule templates.
+
+Note that this tutorial will not run on Windows or recent versions of macOS. To
+get it to run, you will need to wrap the body of this tutorial in a :code:`if
+__name__ == "__main__":` block.
+
+
+.. code-block:: default
+
+
+    import numpy as np
+
+    import tvm
+    from tvm import relay, auto_scheduler
+    import tvm.relay.testing
+    from tvm.contrib import graph_runtime
+
+
+
+
+
+
+
+Define a Network
+----------------
+First, we need to define the network with relay frontend API.
+We can load some pre-defined network from :code:`tvm.relay.testing`.
+We can also load models from MXNet, ONNX, PyTorch, and TensorFlow
+(see :ref:`front end tutorials<tutorial-frontend>`).
+
+Note that although auto-scheduler can work with any layouts,
+we found that the best performance is typically archived with NHWC layout
+for convolutional neural networks, so we use NHWC layout in this tutorial.
+
+
+
+.. code-block:: default
+
+
+
+    def get_network(name, batch_size, layout="NHWC", dtype="float32"):
+        """Get the symbol definition and random weight of a network"""
+
+        # auto-scheduler prefers NHWC layout
+        if layout == "NHWC":
+            image_shape = (224, 224, 3)
+        elif layout == "NCHW":
+            image_shape = (3, 224, 224)
+        else:
+            raise ValueError("Invalid layout: " + layout)
+
+        input_shape = (batch_size,) + image_shape
+        output_shape = (batch_size, 1000)
+
+        if name.startswith("resnet-"):
+            n_layer = int(name.split("-")[1])
+            mod, params = relay.testing.resnet.get_workload(
+                num_layers=n_layer,
+                batch_size=batch_size,
+                layout=layout,
+                dtype=dtype,
+                image_shape=image_shape,
+            )
+        elif name.startswith("resnet3d-"):
+            n_layer = int(name.split("-")[1])
+            mod, params = relay.testing.resnet.get_workload(
+                num_layers=n_layer,
+                batch_size=batch_size,
+                layout=layout,
+                dtype=dtype,
+                image_shape=image_shape,
+            )
+        elif name == "mobilenet":
+            mod, params = relay.testing.mobilenet.get_workload(
+                batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape
+            )
+        elif name == "squeezenet_v1.1":
+            mod, params = relay.testing.squeezenet.get_workload(
+                version="1.1",
+                batch_size=batch_size,
+                layout=layout,
+                dtype=dtype,
+                image_shape=image_shape,
+            )
+        elif name == "inception_v3":
+            input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3)
+            mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
+        elif name == "mxnet":
+            # an example for mxnet model
+            from mxnet.gluon.model_zoo.vision import get_model
+
+            assert layout == "NCHW"
+
+            block = get_model("resnet18_v1", pretrained=True)
+            mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)
+            net = mod["main"]
+            net = relay.Function(
+                net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs
+            )
+            mod = tvm.IRModule.from_expr(net)
+
+        return mod, params, input_shape, output_shape
+
+
+    # Define the neural network and compilation target
+    network = "resnet-18"
+    batch_size = 1
+    layout = "NHWC"
+    target = tvm.target.Target("cuda")
+    dtype = "float32"
+    log_file = "%s-%s-B%d.json" % (network, layout, batch_size)
+
+
+
+
+
+
+
+Extract Search Tasks
+--------------------
+Next, we extract the search tasks and their weights from a network.
+The weight of a task is the number of appearances of the task's subgraph
+in the whole network.
+By using the weight, we can approximate the end-to-end latency of the network
+as :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the
+latency of a task and :code:`weight[t]` is the weight of the task.
+The task scheduler will just optimize this objective.
+
+
+.. code-block:: default
+
+
+    # Enable auto-scheduler in relay
+    auto_scheduler.enable_relay_integration()
+
+    # Extract tasks from the network
+    print("Extract tasks...")
+    mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
+    tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    Extract tasks...
+
+
+
+Begin Tuning
+------------
+Now, we set some options for tuning and launch the search tasks
+
+* :code:`measure_ctx` launches a different process for measurement to
+  provide isolation. It can protect the master process from GPU crashes
+  during measurement and avoid other runtime conflicts.
+* :code:`min_repeat_ms` defines the minimum duration of one "repeat" in every measurement.
+  This can warmup the GPU, which is necessary to get accurate measurement results.
+  Typically, we recommend a value > 300 ms.
+* :code:`num_measure_trials` is the number of measurement trials we can use during the tuning.
+  You can set it to a small number (e.g., 200) for a fast demonstrative run.
+  In practice, we recommend setting it around :code:`1000 * len(tasks)`,
+  which is typically enough for the search to converge.
+  For example, there are 21 tasks in resnet-18, so we can set it as 20000.
+  You can adjust this parameter according to your time budget.
+* In addition, we use :code:`RecordToFile` to dump measurement records into the log file,
+  The measurement records can be used to query the history best, resume the search,
+  and do more analyses later.
+* see :any:`auto_scheduler.TuningOptions`,
+  :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters.
+
+
+
+.. code-block:: default
+
+
+
+    def run_tuning():
+        print("Begin tuning...")
+        measure_ctx = auto_scheduler.LocalRPCMeasureContext(repeat=1, min_repeat_ms=400, timeout=10)
+
+        tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
+        tune_option = auto_scheduler.TuningOptions(
+            num_measure_trials=200,  # change this to 20000 to achieve the best performance
+            runner=measure_ctx.runner,
+            measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
+        )
+
+        tuner.tune(tune_option)
+
+
+    # We do not run the tuning in our webpage server since it takes too long.
+    # Uncomment the following line to run it by yourself.
+
+    # run_tuning()
+
+
+
+
+
+
+
+
+.. note:: Explain the printed information during tuning
+
+  During the tuning, a lot of information will be printed on the console.
+  They are used for debugging purposes. The most important info is the output
+  of the task scheduler. The following table is a sample output.
+
+  .. code-block:: c
+
+    ----------------------------------------------------------------------
+    ------------------------------  [ Task Scheduler ]
+    ----------------------------------------------------------------------
+    |  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
+    -------------------------------------------------
+    |    0 |        0.014 |          72.07 |     64 |
+    |    1 |        0.185 |        1250.68 |    128 |
+    |    2 |        0.142 |        1626.36 |    192 |
+    |    3 |        0.137 |        1689.42 |    128 |
+    |    4 |        0.097 |        1189.75 |    128 |
+    |    5 |        0.092 |        2505.25 |    128 |
+    |    6 |        0.080 |        2893.08 |    128 |
+    |    7 |        0.119 |        1947.84 |    128 |
+    |    8 |        0.090 |        1292.62 |     64 |
+    |    9 |        0.107 |        2172.30 |     64 |
+    |   10 |        0.095 |        2439.36 |     64 |
+    |   11 |        0.077 |        3003.22 |     64 |
+    |   12 |        0.068 |        1695.13 |     64 |
+    |   13 |        0.058 |        3979.29 |     64 |
+    |   14 |        0.048 |        4859.95 |    128 |
+    |   15 |        0.073 |        3151.76 |     64 |
+    |   16 |        0.056 |        4265.94 |     64 |
+    |   17 |        0.009 |        2754.90 |     64 |
+    |   18 |        0.011 |        1156.08 |     64 |
+    |   19 |        0.013 |         955.80 |     64 |
+    |   20 |        0.029 |         437.71 |     64 |
+    -------------------------------------------------
+    Estimated total latency: 1.649 ms  Trials: 1920  Used time : 3598 s  Next ID: 9
+
+  This table lists the latency and (estimated) speed of all tasks.
+  It also lists the allocation of measurement trials for all tasks.
+  The last line prints the total weighted latency of these tasks,
+  which can be a rough estimation of the end-to-end execution time
+  of the network.
+  The last line also prints the total number of measurement trials,
+  total time spent on auto-tuning and the id of the next task to tune.
+
+  There will also be some "dmlc::Error"s and CUDA errors, because the
+  auto-scheduler will try some invalid schedules.
+  You can safely ignore them if the tuning can continue, because these
+  errors are isolated from the master process.
+
+
+.. note:: Terminate the tuning earlier
+
+  You can terminate the tuning earlier by forcibly killing this process.
+  As long as you get at least one valid schedule for each task in the log file,
+  you should be able to do the compilation (the secion below).
+
+
+Compile and Evaluate
+--------------------
+After auto-tuning, we can compile the network with the best schedules we found.
+All measurement records are dumped into the log file during auto-tuning,
+so we can read the log file and load the best schedules.
+
+
+.. code-block:: default
+
+
+    # Compile with the history best
+    print("Compile...")
+    with auto_scheduler.ApplyHistoryBest(log_file):
+        with tvm.transform.PassContext(opt_level=3):
+            lib = relay.build(mod, target=target, params=params)
+
+    # Create graph runtime
+    ctx = tvm.context(str(target), 0)
+    module = graph_runtime.GraphModule(lib["default"](ctx))
+    data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
+    module.set_input("data", data_tvm)
+
+    # Evaluate
+    print("Evaluate inference time cost...")
+    ftimer = module.module.time_evaluator("run", ctx, repeat=3, min_repeat_ms=500)
+    prof_res = np.array(ftimer().results) * 1e3  # convert to millisecond
+    print("Mean inference time (std dev): %.2f ms (%.2f ms)" % (np.mean(prof_res), np.std(prof_res)))
+
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    Compile...
+    Evaluate inference time cost...
+    Mean inference time (std dev): 3.15 ms (0.01 ms)
+
+
+
+Other Tips
+--------------------
+1. During the tuning, the auto-scheduler needs to compile many programs and
+   extract feature from them. This part is CPU-intensive,
+   so a high-performance CPU with many cores is recommended for faster search.
+2. If you have multiple GPUs, you can use all of them for measurements to
+   parallelize the measurements. Check this :ref:`section <tutorials-autotvm-rpc-tracker>`
+   to learn how to use the RPC Tracker and RPC Server.
+   To use the RPC Tracker in auto-scheduler, replace the runner in :code:`TuningOptions`
+   with :any:`auto_scheduler.RPCRunner`.
+
+
+
+.. _sphx_glr_download_tutorials_auto_scheduler_tune_network_cuda.py:
+
+
+.. only :: html
+
+ .. container:: sphx-glr-footer
+    :class: sphx-glr-footer-example
+
+
+
+  .. container:: sphx-glr-download
+
+     :download:`Download Python source code: tune_network_cuda.py <tune_network_cuda.py>`
+
+
+
+  .. container:: sphx-glr-download
+
+     :download:`Download Jupyter notebook: tune_network_cuda.ipynb <tune_network_cuda.ipynb>`
+
+
+.. only:: html
+
+ .. rst-class:: sphx-glr-signature
+
+    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
diff --git a/docs/_sources/tutorials/autotvm/sg_execution_times.rst.txt b/docs/_sources/tutorials/autotvm/sg_execution_times.rst.txt
index 0b4d0aa..1b78ed1 100644
--- a/docs/_sources/tutorials/autotvm/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/autotvm/sg_execution_times.rst.txt
@@ -5,11 +5,11 @@
 
 Computation times
 =================
-**00:58.942** total execution time for **tutorials_autotvm** files:
-
-- **00:35.492**: :ref:`sphx_glr_tutorials_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)
-- **00:22.766**: :ref:`sphx_glr_tutorials_autotvm_tune_simple_template.py` (``tune_simple_template.py``)
-- **00:00.202**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_cuda.py` (``tune_relay_cuda.py``)
-- **00:00.169**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)
-- **00:00.159**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_arm.py` (``tune_relay_arm.py``)
-- **00:00.155**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``)
+**00:59.076** total execution time for **tutorials_autotvm** files:
+
+- **00:30.069**: :ref:`sphx_glr_tutorials_autotvm_tune_conv2d_cuda.py` (``tune_conv2d_cuda.py``)
+- **00:28.300**: :ref:`sphx_glr_tutorials_autotvm_tune_simple_template.py` (``tune_simple_template.py``)
+- **00:00.205**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_cuda.py` (``tune_relay_cuda.py``)
+- **00:00.176**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_x86.py` (``tune_relay_x86.py``)
+- **00:00.163**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_mobile_gpu.py` (``tune_relay_mobile_gpu.py``)
+- **00:00.162**: :ref:`sphx_glr_tutorials_autotvm_tune_relay_arm.py` (``tune_relay_arm.py``)
diff --git a/docs/_sources/tutorials/autotvm/tune_conv2d_cuda.rst.txt b/docs/_sources/tutorials/autotvm/tune_conv2d_cuda.rst.txt
index ea8c268..617f854 100644
--- a/docs/_sources/tutorials/autotvm/tune_conv2d_cuda.rst.txt
+++ b/docs/_sources/tutorials/autotvm/tune_conv2d_cuda.rst.txt
@@ -47,8 +47,7 @@ Now return to python code. Import packages.
     import numpy as np
 
     import tvm
-    from tvm import te
-    from tvm import topi
+    from tvm import te, topi, testing
     from tvm.topi.testing import conv2d_nchw_python
 
     from tvm import autotvm
@@ -242,26 +241,26 @@ for this template
        7 unroll_explicit: OtherOption([0, 1]) len=2
     )
     Get devices for measurement successfully!
-    No: 1   GFLOPS: 0.00/0.00       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 2   GFLOPS: 0.00/0.00       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 3   GFLOPS: 0.00/0.00       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 4   GFLOPS: 6.67/6.67       result: MeasureResult(costs=(0.03468363275,), error_no=0, all_cost=1.767378807067871, timestamp=1604535302.619864)      [('tile_f', [-1, 64, 4, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 1]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,201325
-    No: 5   GFLOPS: 0.00/6.67       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 6   GFLOPS: 0.00/6.67       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 7   GFLOPS: 0.00/6.67       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 8   GFLOPS: 0.00/6.67       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 9   GFLOPS: 11.42/11.42     result: MeasureResult(costs=(0.0202756055,), error_no=0, all_cost=5.4156882762908936, timestamp=1604535309.9670756)     [('tile_f', [-1, 16, 4, 1]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 1, 16]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,10187783
-    No: 10  GFLOPS: 0.00/11.42      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 11  GFLOPS: 5.54/11.42      result: MeasureResult(costs=(0.04179441225,), error_no=0, all_cost=1.6754870414733887, timestamp=1604535311.4029276)    [('tile_f', [-1, 2, 2, 4]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 4, 16]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 1)],None,5741229
-    No: 12  GFLOPS: 0.00/11.42      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 13  GFLOPS: 0.00/11.42      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 14  GFLOPS: 0.00/11.42      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 15  GFLOPS: 0.00/11.42      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 16  GFLOPS: 0.00/11.42      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 17  GFLOPS: 0.00/11.42      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 18  GFLOPS: 0.00/11.42      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
-    No: 19  GFLOPS: 11.83/11.83     result: MeasureResult(costs=(0.0195637445,), error_no=0, all_cost=1.9244208335876465, timestamp=1604535316.3850935)     [('tile_f', [-1, 16, 2, 2]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 8]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,678328
-    No: 20  GFLOPS: 0.00/11.83      result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7fbb5b635fe1]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d0697) [0x7fbb5aa8c697]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7fbb5aa8baae]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 1   GFLOPS: 226.07/226.07   result: MeasureResult(costs=(0.0010240150306122448,), error_no=0, all_cost=1.4391686916351318, timestamp=1605262522.444662)     [('tile_f', [-1, 2, 64, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 2, 2]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4881186
+    No: 2   GFLOPS: 0.00/226.07     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 3   GFLOPS: 179.21/226.07   result: MeasureResult(costs=(0.0012917972661290324,), error_no=0, all_cost=1.6224138736724854, timestamp=1605262523.8775072)    [('tile_f', [-1, 4, 32, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 1]), ('tile_rc', [-1, 1, 16]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3605182
+    No: 4   GFLOPS: 0.00/226.07     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 5   GFLOPS: 0.00/226.07     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 6   GFLOPS: 0.00/226.07     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 7   GFLOPS: 0.00/226.07     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 8   GFLOPS: 1.75/226.07     result: MeasureResult(costs=(0.13202702,), error_no=0, all_cost=3.336221933364868, timestamp=1605262527.2192101)        [('tile_f', [-1, 2, 4, 64]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 2, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 512), ('unroll_explicit', 0)],None,2716108
+    No: 9   GFLOPS: 12.08/226.07    result: MeasureResult(costs=(0.019164146333333333,), error_no=0, all_cost=1.751448392868042, timestamp=1605262530.169132)       [('tile_f', [-1, 1, 4, 2]), ('tile_y', [-1, 7, 1, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 2, 8]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,1263092
+    No: 10  GFLOPS: 228.40/228.40   result: MeasureResult(costs=(0.0010135667474747475,), error_no=0, all_cost=1.4332818984985352, timestamp=1605262531.0474083)    [('tile_f', [-1, 1, 32, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 16, 1]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,8921130
+    No: 11  GFLOPS: 0.00/228.40     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 12  GFLOPS: 120.00/228.40   result: MeasureResult(costs=(0.0019292541346153846,), error_no=0, all_cost=1.344985008239746, timestamp=1605262532.1955059)     [('tile_f', [-1, 2, 32, 4]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 1, 1]), ('tile_ry', [-1, 1, 3]), ('tile_rx', [-1, 1, 3]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,5036371
+    No: 13  GFLOPS: 0.00/228.40     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 14  GFLOPS: 0.00/228.40     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 15  GFLOPS: 82.26/228.40    result: MeasureResult(costs=(0.0028143660526315792,), error_no=0, all_cost=1.4765589237213135, timestamp=1605262533.614049)     [('tile_f', [-1, 1, 1, 4]), ('tile_y', [-1, 1, 1, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 1, 8]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3582580
+    No: 16  GFLOPS: 0.00/228.40     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 17  GFLOPS: 0.00/228.40     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 18  GFLOPS: 0.00/228.40     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
+    No: 19  GFLOPS: 18.26/228.40    result: MeasureResult(costs=(0.012675726555555555,), error_no=0, all_cost=1.667898178100586, timestamp=1605262536.8822658)      [('tile_f', [-1, 8, 64, 1]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 2, 2]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,4107668
+    No: 20  GFLOPS: 0.00/228.40     result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n  [bt] (4) /workspace/build/libtvm.so(TVMFuncCall+0x61) [0x7f902ac24901]\n  [bt] (3) /workspace/build/libtvm.so(+0x6d54a7) [0x7f902a0564a7]\n  [bt] (2) /workspace/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x40e) [0x7f902a0558be]\n  [bt] (1) /workspace/build/libtvm.so(tvm::tir::transform::PrimFunc [...]
 
 
 
@@ -313,8 +312,8 @@ and measure running time.
 
 
     Best config:
-    [('tile_f', [-1, 16, 2, 2]), ('tile_y', [-1, 1, 1, 7]), ('tile_x', [-1, 1, 7, 1]), ('tile_rc', [-1, 1, 8]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 3, 1]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],None,678328
-    Time cost of this operator: 0.015317
+    [('tile_f', [-1, 1, 32, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 7, 1, 1]), ('tile_rc', [-1, 16, 1]), ('tile_ry', [-1, 3, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 1)],None,8921130
+    Time cost of this operator: 0.001455
 
 
 
diff --git a/docs/_sources/tutorials/autotvm/tune_relay_arm.rst.txt b/docs/_sources/tutorials/autotvm/tune_relay_arm.rst.txt
index 7a5b805..40ab833 100644
--- a/docs/_sources/tutorials/autotvm/tune_relay_arm.rst.txt
+++ b/docs/_sources/tutorials/autotvm/tune_relay_arm.rst.txt
@@ -60,9 +60,7 @@ Now return to python code. Import packages.
 
     import numpy as np
     import tvm
-    from tvm import te
-    from tvm import autotvm
-    from tvm import relay
+    from tvm import relay, autotvm
     import tvm.relay.testing
     from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
     from tvm.contrib.utils import tempdir
@@ -107,7 +105,7 @@ We can also load models from MXNet, ONNX and TensorFlow.
                 batch_size=batch_size, version="1.1", dtype=dtype
             )
         elif name == "inception_v3":
-            input_shape = (1, 3, 299, 299)
+            input_shape = (batch_size, 3, 299, 299)
             mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
         elif name == "mxnet":
             # an example for mxnet model
diff --git a/docs/_sources/tutorials/autotvm/tune_relay_cuda.rst.txt b/docs/_sources/tutorials/autotvm/tune_relay_cuda.rst.txt
index d0fd274..e42d10c 100644
--- a/docs/_sources/tutorials/autotvm/tune_relay_cuda.rst.txt
+++ b/docs/_sources/tutorials/autotvm/tune_relay_cuda.rst.txt
@@ -58,12 +58,9 @@ Now return to python code. Import packages.
     import numpy as np
 
     import tvm
-    from tvm import te
-    from tvm import autotvm
-    from tvm import relay
+    from tvm import relay, autotvm
     import tvm.relay.testing
     from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
-    from tvm.contrib.utils import tempdir
     import tvm.contrib.graph_runtime as runtime
 
 
@@ -105,7 +102,7 @@ We can also load models from MXNet, ONNX and TensorFlow.
                 batch_size=batch_size, version="1.1", dtype=dtype
             )
         elif name == "inception_v3":
-            input_shape = (1, 3, 299, 299)
+            input_shape = (batch_size, 3, 299, 299)
             mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
         elif name == "mxnet":
             # an example for mxnet model
@@ -267,11 +264,6 @@ Finally, we launch tuning jobs and evaluate the end-to-end performance.
             with tvm.transform.PassContext(opt_level=3):
                 lib = relay.build_module.build(mod, target=target, params=params)
 
-            # export library
-            tmp = tempdir()
-            filename = "net.tar"
-            lib.export_library(tmp.relpath(filename))
-
             # load parameters
             ctx = tvm.context(str(target), 0)
             module = runtime.GraphModule(lib["default"](ctx))
@@ -352,6 +344,7 @@ As a reference baseline, the time cost of MXNet + TensorRT on resnet-18 is 1.30m
 
 Scale up measurement by using multiple devices
 ----------------------------------------------
+.. _tutorials-autotvm-rpc-tracker:
 
 If you have multiple devices, you can use all of them for measurement.
 TVM uses the RPC Tracker to manage distributed devices.
diff --git a/docs/_sources/tutorials/autotvm/tune_relay_mobile_gpu.rst.txt b/docs/_sources/tutorials/autotvm/tune_relay_mobile_gpu.rst.txt
index ea7dd8f..03e9949 100644
--- a/docs/_sources/tutorials/autotvm/tune_relay_mobile_gpu.rst.txt
+++ b/docs/_sources/tutorials/autotvm/tune_relay_mobile_gpu.rst.txt
@@ -59,9 +59,7 @@ Now return to python code. Import packages.
     import numpy as np
 
     import tvm
-    from tvm import te
-    from tvm import autotvm
-    from tvm import relay
+    from tvm import relay, autotvm
     import tvm.relay.testing
     from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
     from tvm.contrib.utils import tempdir
@@ -106,7 +104,7 @@ We can also load models from MXNet, ONNX and TensorFlow.
                 batch_size=batch_size, version="1.1", dtype=dtype
             )
         elif name == "inception_v3":
-            input_shape = (1, 3, 299, 299)
+            input_shape = (batch_size, 3, 299, 299)
             mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
         elif name == "mxnet":
             # an example for mxnet model
diff --git a/docs/_sources/tutorials/autotvm/tune_relay_x86.rst.txt b/docs/_sources/tutorials/autotvm/tune_relay_x86.rst.txt
index c6d5933..6d131ba 100644
--- a/docs/_sources/tutorials/autotvm/tune_relay_x86.rst.txt
+++ b/docs/_sources/tutorials/autotvm/tune_relay_x86.rst.txt
@@ -27,9 +27,7 @@ __name__ == "__main__":` block.
     import numpy as np
 
     import tvm
-    from tvm import te
-    from tvm import autotvm
-    from tvm import relay
+    from tvm import relay, autotvm
     from tvm.relay import testing
     from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
     from tvm.autotvm.graph_tuner import DPTuner, PBQPTuner
@@ -77,7 +75,7 @@ In this tutorial, we choose resnet-18 as tuning example.
                 batch_size=batch_size, version="1.1", dtype=dtype
             )
         elif name == "inception_v3":
-            input_shape = (1, 3, 299, 299)
+            input_shape = (batch_size, 3, 299, 299)
             mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
         elif name == "mxnet":
             # an example for mxnet model
diff --git a/docs/_sources/tutorials/autotvm/tune_simple_template.rst.txt b/docs/_sources/tutorials/autotvm/tune_simple_template.rst.txt
index 0b00525..f736406 100644
--- a/docs/_sources/tutorials/autotvm/tune_simple_template.rst.txt
+++ b/docs/_sources/tutorials/autotvm/tune_simple_template.rst.txt
@@ -53,7 +53,7 @@ Now return to python code. Import packages.
 
     import numpy as np
     import tvm
-    from tvm import te
+    from tvm import te, testing
 
     # the module is called `autotvm`
     from tvm import autotvm
@@ -369,16 +369,16 @@ used to get the best config later.
  .. code-block:: none
 
     Get devices for measurement successfully!
-    No: 1   GFLOPS: 12.60/12.60     result: MeasureResult(costs=(0.021309730199999998,), error_no=0, all_cost=0.8643097877502441, timestamp=1604535277.780555)      [('tile_y', [-1, 1]), ('tile_x', [-1, 64])],None,60
-    No: 2   GFLOPS: 10.92/12.60     result: MeasureResult(costs=(0.0245915336,), error_no=0, all_cost=0.796893835067749, timestamp=1604535278.6252985)      [('tile_y', [-1, 512]), ('tile_x', [-1, 512])],None,99
-    No: 3   GFLOPS: 14.11/14.11     result: MeasureResult(costs=(0.019021120199999998,), error_no=0, all_cost=0.8052010536193848, timestamp=1604535279.389854)      [('tile_y', [-1, 8]), ('tile_x', [-1, 512])],None,93
-    No: 4   GFLOPS: 2.56/14.11      result: MeasureResult(costs=(0.104887642,), error_no=0, all_cost=2.149117946624756, timestamp=1604535281.517971)        [('tile_y', [-1, 2]), ('tile_x', [-1, 16])],None,41
-    No: 5   GFLOPS: 10.09/14.11     result: MeasureResult(costs=(0.0266062698,), error_no=0, all_cost=1.2246041297912598, timestamp=1604535282.4069664)     [('tile_y', [-1, 2]), ('tile_x', [-1, 128])],None,71
-    No: 6   GFLOPS: 9.87/14.11      result: MeasureResult(costs=(0.027194059399999998,), error_no=0, all_cost=1.3297405242919922, timestamp=1604535283.2947125)     [('tile_y', [-1, 16]), ('tile_x', [-1, 128])],None,74
-    No: 7   GFLOPS: 0.44/14.11      result: MeasureResult(costs=(0.6160579704,), error_no=0, all_cost=10.218012571334839, timestamp=1604535293.5908499)     [('tile_y', [-1, 512]), ('tile_x', [-1, 1])],None,9
-    No: 8   GFLOPS: 10.97/14.11     result: MeasureResult(costs=(0.0244691906,), error_no=0, all_cost=0.8608279228210449, timestamp=1604535294.43033)       [('tile_y', [-1, 8]), ('tile_x', [-1, 64])],None,63
-    No: 9   GFLOPS: 12.51/14.11     result: MeasureResult(costs=(0.0214495178,), error_no=0, all_cost=0.6790502071380615, timestamp=1604535296.0291827)     [('tile_y', [-1, 256]), ('tile_x', [-1, 512])],None,98
-    No: 10  GFLOPS: 14.74/14.74     result: MeasureResult(costs=(0.0182092076,), error_no=0, all_cost=0.7521607875823975, timestamp=1604535296.7846766)     [('tile_y', [-1, 64]), ('tile_x', [-1, 128])],None,76
+    No: 1   GFLOPS: 0.52/0.52       result: MeasureResult(costs=(0.519133092,), error_no=0, all_cost=8.710088014602661, timestamp=1605262499.8931446)       [('tile_y', [-1, 64]), ('tile_x', [-1, 1])],None,6
+    No: 2   GFLOPS: 2.19/2.19       result: MeasureResult(costs=(0.122798191,), error_no=0, all_cost=2.4234249591827393, timestamp=1605262502.3358595)      [('tile_y', [-1, 512]), ('tile_x', [-1, 8])],None,39
+    No: 3   GFLOPS: 2.68/2.68       result: MeasureResult(costs=(0.1002148718,), error_no=0, all_cost=2.024742603302002, timestamp=1605262504.4139025)      [('tile_y', [-1, 2]), ('tile_x', [-1, 8])],None,31
+    No: 4   GFLOPS: 7.24/7.24       result: MeasureResult(costs=(0.0370866816,), error_no=0, all_cost=1.0611913204193115, timestamp=1605262505.483117)      [('tile_y', [-1, 1]), ('tile_x', [-1, 32])],None,50
+    No: 5   GFLOPS: 13.37/13.37     result: MeasureResult(costs=(0.020077077,), error_no=0, all_cost=0.7708723545074463, timestamp=1605262506.2793317)      [('tile_y', [-1, 256]), ('tile_x', [-1, 64])],None,68
+    No: 6   GFLOPS: 12.17/13.37     result: MeasureResult(costs=(0.0220493612,), error_no=0, all_cost=0.7993049621582031, timestamp=1605262507.1112614)     [('tile_y', [-1, 256]), ('tile_x', [-1, 512])],None,98
+    No: 7   GFLOPS: 0.92/13.37      result: MeasureResult(costs=(0.29137312579999997,), error_no=0, all_cost=5.066913843154907, timestamp=1605262512.2570298)       [('tile_y', [-1, 128]), ('tile_x', [-1, 2])],None,17
+    No: 8   GFLOPS: 2.61/13.37      result: MeasureResult(costs=(0.102951418,), error_no=0, all_cost=2.0490610599517822, timestamp=1605262514.3929913)      [('tile_y', [-1, 8]), ('tile_x', [-1, 4])],None,23
+    No: 9   GFLOPS: 11.68/13.37     result: MeasureResult(costs=(0.0229774654,), error_no=0, all_cost=0.7303047180175781, timestamp=1605262515.9335515)     [('tile_y', [-1, 256]), ('tile_x', [-1, 32])],None,58
+    No: 10  GFLOPS: 14.79/14.79     result: MeasureResult(costs=(0.018150249,), error_no=0, all_cost=0.760230541229248, timestamp=1605262516.7134416)       [('tile_y', [-1, 64]), ('tile_x', [-1, 128])],None,76
 
 
 
diff --git a/docs/_sources/tutorials/dev/bring_your_own_datatypes.rst.txt b/docs/_sources/tutorials/dev/bring_your_own_datatypes.rst.txt
index c5f9343..f437fd3 100644
--- a/docs/_sources/tutorials/dev/bring_your_own_datatypes.rst.txt
+++ b/docs/_sources/tutorials/dev/bring_your_own_datatypes.rst.txt
@@ -521,7 +521,7 @@ Now, to actually convert the entire network, we have written `a pass in Relay <h
 
  .. code-block:: none
 
-      Check failed: lower == false: Intrinsic lowering function for target llvm, intrinsic name tir.sqrt, type 150 not found
+      Check failed: lower == false: FloatImm lowering function for target llvm type 150 not found
 
 
 
diff --git a/docs/_sources/tutorials/dev/low_level_custom_pass.rst.txt b/docs/_sources/tutorials/dev/low_level_custom_pass.rst.txt
index b51499e..d0dcd93 100644
--- a/docs/_sources/tutorials/dev/low_level_custom_pass.rst.txt
+++ b/docs/_sources/tutorials/dev/low_level_custom_pass.rst.txt
@@ -72,11 +72,10 @@ our customized lowering pass to manipulate the IR directly instead of using sche
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     primfn(a_1: handle, b_1: handle, c_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
-      buffers = {b: Buffer(b_2: Pointer(float32), float32, [128], []),
-                 c: Buffer(c_2: Pointer(float32), float32, [128], []),
+      buffers = {c: Buffer(c_2: Pointer(float32), float32, [128], []),
+                 b: Buffer(b_2: Pointer(float32), float32, [128], []),
                  a: Buffer(a_2: Pointer(float32), float32, [128], [])}
       buffer_map = {a_1: a, b_1: b, c_1: c} {
       for (i: int32, 0, 128) {
@@ -84,35 +83,7 @@ our customized lowering pass to manipulate the IR directly instead of using sche
       }
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "IntImm"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "bool", 
-            "value": "1"
-          }
-        }
-      ], 
-      "b64ndarrays": [], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
@@ -248,7 +219,6 @@ Thus, a good place to put this transformation pass is just after Phase 1.
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     primfn(a_1: handle, b_1: handle, c_1: handle) -> ()
       attr = {"global_symbol": "main", "tir.noalias": True}
       buffers = {c: Buffer(c_2: Pointer(float32), float32, [128], []),
@@ -260,35 +230,7 @@ Thus, a good place to put this transformation pass is just after Phase 1.
       }
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "IntImm"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "bool", 
-            "value": "1"
-          }
-        }
-      ], 
-      "b64ndarrays": [], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
diff --git a/docs/_sources/tutorials/dev/sg_execution_times.rst.txt b/docs/_sources/tutorials/dev/sg_execution_times.rst.txt
index 5dfa385..3fd5151 100644
--- a/docs/_sources/tutorials/dev/sg_execution_times.rst.txt
+++ b/docs/_sources/tutorials/dev/sg_execution_times.rst.txt
@@ -5,8 +5,8 @@
 
 Computation times
 =================
-**00:33.069** total execution time for **tutorials_dev** files:
+**00:31.889** total execution time for **tutorials_dev** files:
 
-- **00:31.106**: :ref:`sphx_glr_tutorials_dev_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``)
-- **00:01.760**: :ref:`sphx_glr_tutorials_dev_use_pass_infra.py` (``use_pass_infra.py``)
-- **00:00.203**: :ref:`sphx_glr_tutorials_dev_low_level_custom_pass.py` (``low_level_custom_pass.py``)
+- **00:31.318**: :ref:`sphx_glr_tutorials_dev_bring_your_own_datatypes.py` (``bring_your_own_datatypes.py``)
+- **00:00.391**: :ref:`sphx_glr_tutorials_dev_use_pass_infra.py` (``use_pass_infra.py``)
+- **00:00.180**: :ref:`sphx_glr_tutorials_dev_low_level_custom_pass.py` (``low_level_custom_pass.py``)
diff --git a/docs/_sources/tutorials/dev/use_pass_infra.rst.txt b/docs/_sources/tutorials/dev/use_pass_infra.rst.txt
index f4589eb..88c7fe3 100644
--- a/docs/_sources/tutorials/dev/use_pass_infra.rst.txt
+++ b/docs/_sources/tutorials/dev/use_pass_infra.rst.txt
@@ -142,7 +142,6 @@ Manually Apply Optimization Passes
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %0 = nn.conv2d(%x, %weight, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
       %1 = add(%0, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -151,127 +150,7 @@ Manually Apply Optimization Passes
       add(%2, %3) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
@@ -294,7 +173,6 @@ eliminate the common expressions that used by `z` and `z1`.
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %0 = nn.conv2d(%x, %weight, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
       %1 = add(%0, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -302,127 +180,7 @@ eliminate the common expressions that used by `z` and `z1`.
       add(%2, %2) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
@@ -449,7 +207,6 @@ opt level 0 will not allow operators to be fused together. Users can pass the
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %0 = fn (%p0: Tensor[(1, 64, 56, 56), float32], %p1: Tensor[(64, 64, 3, 3), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         nn.conv2d(%p0, %p1, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */
@@ -469,127 +226,7 @@ opt level 0 will not allow operators to be fused together. Users can pass the
       %6(%5) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
@@ -641,7 +278,6 @@ pass.
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %4 = fn (%p0: Tensor[(1, 64, 56, 56), float32], %p1: Tensor[(64, 64, 3, 3), float32], %p2: Tensor[(1, 64, 54, 54), float32], %p3: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         %0 = nn.conv2d(%p0, %p1, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -653,127 +289,7 @@ pass.
       %4(%x, %weight, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, meta[relay.Constant][1] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIJhs1QKCgpLCgoKawoKCksAAIB/EgoKSwoKKksKCgpLCgoKS5IJCksKCgpTCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgo [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIJhs1MKCgpKCgoKagoKCkoAAIB/EgoKSgoKKkoKCgpKCgoKSpIJCkoKCgpSCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgo [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
@@ -803,7 +319,6 @@ for users to customize the optimization level that they want to execute.
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %3 = fn (%p0: Tensor[(1, 64, 56, 56), float32], %p1: Tensor[(64, 64, 3, 3), float32], %p2: Tensor[(1, 64, 54, 54), float32], %p3: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         %0 = nn.conv2d(%p0, %p1, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -814,127 +329,7 @@ for users to customize the optimization level that they want to execute.
       %3(%x, %weight, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, meta[relay.Constant][1] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIJhs1QKCgpLCgoKawoKCksAAIB/EgoKSwoKKksKCgpLCgoKS5IJCksKCgpTCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgo [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIJhs1MKCgpKCgoKagoKCkoAAIB/EgoKSgoKKkoKCgpKCgoKSpIJCkoKCgpSCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgo [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
@@ -964,7 +359,6 @@ identical addition operations.
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %4 = fn (%p0: Tensor[(1, 64, 56, 56), float32], %p1: Tensor[(64, 64, 3, 3), float32], %p2: Tensor[(1, 64, 54, 54), float32], %p3: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         %0 = nn.conv2d(%p0, %p1, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -976,127 +370,7 @@ identical addition operations.
       %4(%x, %weight, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, meta[relay.Constant][1] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIJhs1QKCgpLCgoKawoKCksAAIB/EgoKSwoKKksKCgpLCgoKS5IJCksKCgpTCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgo [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIJhs1MKCgpKCgoKagoKCkoAAIB/EgoKSgoKKkoKCgpKCgoKSpIJCkoKCgpSCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgo [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
@@ -1128,7 +402,6 @@ alteration pass falls in such category.
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %3 = fn (%p0: Tensor[(1, 64, 56, 56), float32], %p1: Tensor[(64, 64, 3, 3), float32], %p2: Tensor[(1, 64, 54, 54), float32], %p3: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         %0 = nn.conv2d(%p0, %p1, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -1139,128 +412,7 @@ alteration pass falls in such category.
       %3(%x, %weight, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, meta[relay.Constant][1] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIJhs1QKCgpLCgoKawoKCksAAIB/EgoKSwoKKksKCgpLCgoKS5IJCksKCgpTCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgoKSwoKCksKCgpLCgo [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIJhs1MKCgpKCgoKagoKCkoAAIB/EgoKSgoKKkoKCgpKCgoKSpIJCkoKCgpSCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgo [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
-    #[version = "0.0.5"]
+
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %0 = layout_transform(%x, src_layout="NCHW", dst_layout="NCHW16c") /* ty=Tensor[(1, 4, 56, 56, 16), float32] */;
       %1 = nn.conv2d(%0, %weight, padding=[0, 0, 0, 0], data_layout="NCHW16c") /* ty=Tensor[(1, 4, 54, 54, 16), float32] */;
@@ -1275,78 +427,7 @@ alteration pass falls in such category.
       layout_transform(%9, src_layout="NCHW16c", dst_layout="NCHW") /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIJhs1MKCgpKCgoKagoKCkoAAIB/EgoKSgoKKkoKCgpKCgoKSpIJCkoKCgpSCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgoKSgoKCkoKCgpKCgo [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
@@ -1401,7 +482,6 @@ customized pass.
 
  .. code-block:: none
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %0 = nn.conv2d(%x, %weight, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
       %1 = multiply(3f /* ty=float32 */, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -1414,78 +494,7 @@ customized pass.
       add(%6, %7) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgH8AAAAAEYGyeQAAAAAKCgpKCgoKSgoSCko6C4pKCgoKSgoKCkoKcjpLCgoKSgoKCkpaG7pLCgoKSgoKCkoKCgpKCgoKSgoKCkoAAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAC [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
 
 
 
@@ -1552,7 +561,6 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
     Running pass: {} The meta data of the pass: pass name: FoldConstantopt_level: 2required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) {
       %0 = nn.conv2d(%x, %weight, padding=[0, 0, 0, 0]);
       %1 = add(meta[relay.Constant][0], meta[relay.Constant][0]);
@@ -1563,162 +571,26 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       add(%4, %5)
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "0", 
-            "data": "0", 
-            "span": "0"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main() {
       add(meta[relay.Constant][0], meta[relay.Constant][0])
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "0", 
-            "data": "0", 
-            "span": "0"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: FuseOpsopt_level: 1required passes: [
     InferType, ]
 
-    #[version = "0.0.5"]
     def @main() -> Tensor[(1, 64, 54, 54), float32] {
       add(meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main() -> Tensor[(1, 64, 54, 54), float32] {
       %0 = fn (%p0: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         add(%p0, %p0)
@@ -1726,82 +598,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       %0(meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */)
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: ToANormalFormopt_level: 1required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main() -> Tensor[(1, 64, 54, 54), float32] {
       %0 = fn (%p0: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         add(%p0, %p0) /* ty=Tensor[(1, 64, 54, 54), float32] */
@@ -1809,82 +609,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       %0(meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main() -> Tensor[(1, 64, 54, 54), float32] {
       let %x = meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */;
       let %x1 = fn (%p0: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
@@ -1894,202 +622,26 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       %x2
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main() {
       multiply(meta[relay.Constant][0], 2f)
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "0", 
-            "data": "0", 
-            "span": "0"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNR+RkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: FuseOpsopt_level: 1required passes: [
     InferType, ]
 
-    #[version = "0.0.5"]
     def @main() -> Tensor[(1, 64, 54, 54), float32] {
       multiply(meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, 2f /* ty=float32 */) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNR+RkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main() -> Tensor[(1, 64, 54, 54), float32] {
       %0 = fn (%p0: Tensor[(1, 64, 54, 54), float32], %p1: float32, Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         multiply(%p0, %p1)
@@ -2097,82 +649,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       %0(meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, 2f /* ty=float32 */)
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNR+RkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: ToANormalFormopt_level: 1required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main() -> Tensor[(1, 64, 54, 54), float32] {
       %0 = fn (%p0: Tensor[(1, 64, 54, 54), float32], %p1: float32, Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         multiply(%p0, %p1) /* ty=Tensor[(1, 64, 54, 54), float32] */
@@ -2180,82 +660,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       %0(meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, 2f /* ty=float32 */) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNR+RkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main() -> Tensor[(1, 64, 54, 54), float32] {
       let %x = meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */;
       let %x1 = 2f /* ty=float32 */;
@@ -2266,82 +674,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       %x3
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNR+RkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) {
       %0 = nn.conv2d(%x, %weight, padding=[0, 0, 0, 0]);
       %1 = add(%0, meta[relay.Constant][0]);
@@ -2350,51 +686,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       add(%2, %3)
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 4]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "0", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "0", 
-            "data": "1", 
-            "span": "0"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: PrintIRopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %0 = nn.conv2d(%x, %weight, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
       %1 = add(%0, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -2403,131 +698,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       add(%2, %3) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %0 = nn.conv2d(%x, %weight, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
       %1 = add(%0, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -2536,131 +710,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       add(%2, %3) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: EliminateCommonSubexpropt_level: 3required passes: [
     InferType, ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %0 = nn.conv2d(%x, %weight, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
       %1 = add(%0, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -2669,131 +722,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       add(%2, %3) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %0 = nn.conv2d(%x, %weight, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
       %1 = add(%0, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -2801,131 +733,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       add(%2, %2)
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %0 = nn.conv2d(%x, %weight, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
       %1 = add(%0, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -2933,131 +744,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       add(%2, %2) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: FuseOpsopt_level: 1required passes: [
     InferType, ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %0 = nn.conv2d(%x, %weight, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
       %1 = add(%0, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -3065,131 +755,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       add(%2, %2) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %3 = fn (%p0: Tensor[(1, 64, 56, 56), float32], %p1: Tensor[(64, 64, 3, 3), float32], %p2: Tensor[(1, 64, 54, 54), float32], %p3: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         %0 = nn.conv2d(%p0, %p1, padding=[0, 0, 0, 0]);
@@ -3200,131 +769,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       %3(%x, %weight, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, meta[relay.Constant][1] /* ty=Tensor[(1, 64, 54, 54), float32] */)
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %3 = fn (%p0: Tensor[(1, 64, 56, 56), float32], %p1: Tensor[(64, 64, 3, 3), float32], %p2: Tensor[(1, 64, 54, 54), float32], %p3: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         %0 = nn.conv2d(%p0, %p1, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -3335,131 +783,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       %3(%x, %weight, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, meta[relay.Constant][1] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: AlterOpLayoutopt_level: 3required passes: [
     InferType, ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %3 = fn (%p0: Tensor[(1, 64, 56, 56), float32], %p1: Tensor[(64, 64, 3, 3), float32], %p2: Tensor[(1, 64, 54, 54), float32], %p3: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         %0 = nn.conv2d(%p0, %p1, padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 64, 54, 54), float32] */;
@@ -3470,131 +797,10 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       %3(%x, %weight, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, meta[relay.Constant][1] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     Running pass: {} The meta data of the pass: pass name: InferTypeopt_level: 0required passes: [
     ]
 
-    #[version = "0.0.5"]
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %7 = fn (%p0: Tensor[(1, 64, 56, 56), float32], %p1: Tensor[(64, 64, 3, 3), float32], %p2: Tensor[(1, 64, 54, 54), float32], %p3: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         %0 = layout_transform(%p0, src_layout="NCHW", dst_layout="NCHW16c");
@@ -3609,128 +815,7 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       %7(%x, %weight, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, meta[relay.Constant][1] /* ty=Tensor[(1, 64, 54, 54), float32] */)
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
-    #[version = "0.0.5"]
+
     def @main(%x: Tensor[(1, 64, 56, 56), float32], %weight: Tensor[(64, 64, 3, 3), float32]) -> Tensor[(1, 64, 54, 54), float32] {
       %7 = fn (%p0: Tensor[(1, 64, 56, 56), float32], %p1: Tensor[(64, 64, 3, 3), float32], %p2: Tensor[(1, 64, 54, 54), float32], %p3: Tensor[(1, 64, 54, 54), float32], Primitive=1) -> Tensor[(1, 64, 54, 54), float32] {
         %0 = layout_transform(%p0, src_layout="NCHW", dst_layout="NCHW16c") /* ty=Tensor[(1, 4, 56, 56, 16), float32] */;
@@ -3745,127 +830,7 @@ could be like the following to enable IR dumping for ``FoldConstant`` optimizati
       %7(%x, %weight, meta[relay.Constant][0] /* ty=Tensor[(1, 64, 54, 54), float32] */, meta[relay.Constant][1] /* ty=Tensor[(1, 64, 54, 54), float32] */) /* ty=Tensor[(1, 64, 54, 54), float32] */
     }
 
-    #[metadata]
-    {
-      "root": 1, 
-      "nodes": [
-        {
-          "type_key": ""
-        }, 
-        {
-          "type_key": "Map", 
-          "keys": [
-            "relay.Constant"
-          ], 
-          "data": [2]
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [3, 10]
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "4", 
-            "data": "0", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "5", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [6, 7, 8, 9]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "relay.Constant", 
-          "attrs": {
-            "_checked_type_": "11", 
-            "data": "1", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "relay.TensorType", 
-          "attrs": {
-            "dtype": "float32", 
-            "shape": "12", 
-            "span": "0"
-          }
-        }, 
-        {
-          "type_key": "Array", 
-          "data": [13, 14, 15, 16]
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "1"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "64"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }, 
-        {
-          "type_key": "IntImm", 
-          "attrs": {
-            "dtype": "int32", 
-            "value": "54"
-          }
-        }
-      ], 
-      "b64ndarrays": [
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRoAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-        "P6G0lvBAXt0AAAAAAAAAAAEAAAAAAAAABAAAAAIgAQABAAAAAAAAAEAAAAAAAAAANgAAAAAAAAA2AAAAAAAAAABkCwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAAAAAAAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AACAfwAAgH8AAAAAAACAfwAAgH8AAIB/AACAfwAAgH8AAAAAAAAAAAAAAAAAAAAAAACAfwAAAAAAAAAAAACAfwNReRkAAAAAAAAAAAAAAAAAAIB/AAAAAAAAAAAAAIB/AAA [...]
-      ], 
-      "attrs": {"tvm_version": "0.8.dev0"}
-    }
+
     done
 
 
diff --git a/docs/_sources/tutorials/frontend/deploy_model_on_android.rst.txt b/docs/_sources/tutorials/frontend/deploy_model_on_android.rst.txt
index ae57a42..80135f9 100644
--- a/docs/_sources/tutorials/frontend/deploy_model_on_android.rst.txt
+++ b/docs/_sources/tutorials/frontend/deploy_model_on_android.rst.txt
@@ -421,7 +421,7 @@ Execute on TVM
 
     TVM prediction top-1: tiger cat
     Evaluate inference time cost...
-    Mean inference time (std dev): 5.84 ms (0.08 ms)
+    Mean inference time (std dev): 5.41 ms (0.17 ms)
 
 
 
diff --git a/docs/_sources/tutorials/frontend/deploy_object_detection_pytorch.rst.txt b/docs/_sources/tutorials/frontend/deploy_object_detection_pytorch.rst.txt
index ee12e45..25f54fb 100644
--- a/docs/_sources/tutorials/frontend/deploy_object_detection_pytorch.rst.txt
+++ b/docs/_sources/tutorials/frontend/deploy_object_detection_pytorch.rst.txt
@@ -247,7 +247,7 @@ Get boxes with score larger than 0.9
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  3.013 seconds)
+   **Total running time of the script:** ( 2 minutes  5.919 seconds)
 
 
 .. _sphx_glr_download_tutorials_frontend_deploy_object_detection_pytorch.py:
diff --git a/docs/_sources/tutorials/frontend/deploy_prequantized.rst.txt b/docs/_sources/tutorials/frontend/deploy_prequantized.rst.txt
index 4a0dc6e..655b37b 100644
--- a/docs/_sources/tutorials/frontend/deploy_prequantized.rst.txt
+++ b/docs/_sources/tutorials/frontend/deploy_prequantized.rst.txt
@@ -350,7 +350,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
  .. code-block:: none
 
-    Elapsed average ms: 20.089770379999997
+    Elapsed average ms: 19.227042330000003
 
 
 
diff --git a/docs/_sources/tutorials/frontend/deploy_prequantized_tflite.rst.txt b/docs/_sources/tutorials/frontend/deploy_prequantized_tflite.rst.txt
index 50e8e95..22b583f 100644
--- a/docs/_sources/tutorials/frontend/deploy_prequantized_tflite.rst.txt
+++ b/docs/_sources/tutorials/frontend/deploy_prequantized_tflite.rst.txt
@@ -368,7 +368,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
  .. code-block:: none
 
-    Elapsed average ms: 36.11040422000001
+    Elapsed average ms: 36.272248340000004
 
 
 
@@ -401,7 +401,7 @@ Here we give an example of how to measure performance of TVM compiled models.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 2 minutes  33.806 seconds)
+   **Total running time of the script:** ( 2 minutes  37.496 seconds)
 
 
 .. _sphx_glr_download_tutorials_frontend_deploy_prequantized_tflite.py:
diff --git a/docs/_sources/tutorials/frontend/deploy_ssd_gluoncv.rst.txt b/docs/_sources/tutorials/frontend/deploy_ssd_gluoncv.rst.txt
index 169672a..e1fbb21 100644
--- a/docs/_sources/tutorials/frontend/deploy_ssd_gluoncv.rst.txt
+++ b/docs/_sources/tutorials/frontend/deploy_ssd_gluoncv.rst.txt
@@ -167,130 +167,6 @@ Create TVM runtime and do inference
 
 
 
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 3, 512, 512), 'float32'), ('TENSOR', (64, 3, 7, 7), 'float32'), (2, 2), (3, 3, 3, 3), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 128, 128), 'float32'), ('TENSOR', (64, 64, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 128, 128), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 128, 128), 'float32'), ('TENSOR', (256, 64, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 128, 128), 'float32'), ('TENSOR', (64, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 128, 128), 'float32'), ('TENSOR', (128, 256, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 128, 64, 64), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 128, 64, 64), 'float32'), ('TENSOR', (512, 128, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 128, 128), 'float32'), ('TENSOR', (512, 256, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 64, 64), 'float32'), ('TENSOR', (128, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 64, 64), 'float32'), ('TENSOR', (256, 512, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 32, 32), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 32, 32), 'float32'), ('TENSOR', (1024, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 64, 64), 'float32'), ('TENSOR', (1024, 512, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 1024, 32, 32), 'float32'), ('TENSOR', (256, 1024, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 1024, 32, 32), 'float32'), ('TENSOR', (84, 1024, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 1024, 32, 32), 'float32'), ('TENSOR', (512, 1024, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 16, 16), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 16, 16), 'float32'), ('TENSOR', (2048, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 1024, 32, 32), 'float32'), ('TENSOR', (2048, 1024, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 2048, 16, 16), 'float32'), ('TENSOR', (512, 2048, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 2048, 16, 16), 'float32'), ('TENSOR', (126, 2048, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 16, 16), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 8, 8), 'float32'), ('TENSOR', (126, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 8, 8), 'float32'), ('TENSOR', (512, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 8, 8), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 4, 4), 'float32'), ('TENSOR', (126, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 4, 4), 'float32'), ('TENSOR', (256, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 4, 4), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 2, 2), 'float32'), ('TENSOR', (84, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 2, 2), 'float32'), ('TENSOR', (256, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 2, 2), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 1, 1), 'float32'), ('TENSOR', (84, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 1024, 32, 32), 'float32'), ('TENSOR', (16, 1024, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 2048, 16, 16), 'float32'), ('TENSOR', (24, 2048, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 8, 8), 'float32'), ('TENSOR', (24, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 4, 4), 'float32'), ('TENSOR', (24, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 2, 2), 'float32'), ('TENSOR', (16, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 1, 1), 'float32'), ('TENSOR', (16, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 32, 32, 4), 'float32'), ('TENSOR', (12, 256, 3, 3, 4, 7), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW4c', 'NCHW7c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 32, 32, 8), 'float32'), ('TENSOR', (128, 32, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 32, 32, 8), 'float32'), ('TENSOR', (32, 32, 3, 3, 8, 8), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 128, 32, 32, 8), 'float32'), ('TENSOR', (32, 128, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 64, 64, 8), 'float32'), ('TENSOR', (32, 64, 1, 1, 8, 8), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 16, 64, 64, 8), 'float32'), ('TENSOR', (64, 16, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 16, 64, 64, 8), 'float32'), ('TENSOR', (16, 16, 3, 3, 8, 8), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 64, 64, 8), 'float32'), ('TENSOR', (16, 64, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 128, 128, 8), 'float32'), ('TENSOR', (16, 32, 1, 1, 8, 8), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 8, 128, 128, 8), 'float32'), ('TENSOR', (32, 8, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 8, 128, 128, 8), 'float32'), ('TENSOR', (8, 8, 3, 3, 8, 8), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 128, 128, 8), 'float32'), ('TENSOR', (8, 32, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 8, 128, 128, 8), 'float32'), ('TENSOR', (8, 8, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 1, 512, 512, 3), 'float32'), ('TENSOR', (8, 1, 7, 7, 3, 8), 'float32'), (2, 2), (3, 3, 3, 3), (1, 1), 'NCHW3c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 128, 128, 8), 'float32'), ('TENSOR', (64, 32, 1, 1, 8, 8), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 64, 64, 8), 'float32'), ('TENSOR', (128, 64, 1, 1, 8, 8), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 512, 16, 16, 4), 'float32'), ('TENSOR', (18, 512, 3, 3, 4, 7), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW4c', 'NCHW7c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 16, 16, 8), 'float32'), ('TENSOR', (256, 64, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 16, 16, 8), 'float32'), ('TENSOR', (64, 64, 3, 3, 8, 8), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 16, 16, 8), 'float32'), ('TENSOR', (64, 256, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 128, 32, 32, 8), 'float32'), ('TENSOR', (64, 128, 1, 1, 8, 8), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 128, 32, 32, 8), 'float32'), ('TENSOR', (256, 128, 1, 1, 8, 8), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 128, 8, 8, 4), 'float32'), ('TENSOR', (18, 128, 3, 3, 4, 7), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW4c', 'NCHW7c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 16, 16, 8), 'float32'), ('TENSOR', (64, 64, 3, 3, 8, 8), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 128, 4, 4, 4), 'float32'), ('TENSOR', (18, 128, 3, 3, 4, 7), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW4c', 'NCHW7c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 8, 8, 8), 'float32'), ('TENSOR', (64, 64, 3, 3, 8, 8), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 8, 8, 8), 'float32'), ('TENSOR', (64, 64, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 2, 2, 4), 'float32'), ('TENSOR', (12, 64, 3, 3, 4, 7), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW4c', 'NCHW7c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 4, 4, 8), 'float32'), ('TENSOR', (32, 32, 3, 3, 8, 8), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 4, 4, 8), 'float32'), ('TENSOR', (32, 64, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 1, 1, 4), 'float32'), ('TENSOR', (12, 64, 3, 3, 4, 7), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW4c', 'NCHW7c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 2, 2, 8), 'float32'), ('TENSOR', (32, 32, 3, 3, 8, 8), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 2, 2, 8), 'float32'), ('TENSOR', (32, 32, 1, 1, 8, 8), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 128, 32, 32, 8), 'float32'), ('TENSOR', (2, 128, 3, 3, 8, 8), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 256, 16, 16, 8), 'float32'), ('TENSOR', (3, 256, 3, 3, 8, 8), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 8, 8, 8), 'float32'), ('TENSOR', (3, 64, 3, 3, 8, 8), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 4, 4, 8), 'float32'), ('TENSOR', (3, 64, 3, 3, 8, 8), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 2, 2, 8), 'float32'), ('TENSOR', (2, 32, 3, 3, 8, 8), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 1, 1, 8), 'float32'), ('TENSOR', (2, 32, 3, 3, 8, 8), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW8c', 'NCHW8c', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 3, 512, 512), 'float32'), ('TENSOR', (64, 3, 7, 7), 'float32'), (2, 2), (3, 3, 3, 3), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 64, 128, 128), 'float32'), ('TENSOR', (64, 64, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 64, 128, 128), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 64, 128, 128), 'float32'), ('TENSOR', (256, 64, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 128, 128), 'float32'), ('TENSOR', (64, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 128, 128), 'float32'), ('TENSOR', (128, 256, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 128, 64, 64), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 128, 64, 64), 'float32'), ('TENSOR', (512, 128, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 128, 128), 'float32'), ('TENSOR', (512, 256, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 64, 64), 'float32'), ('TENSOR', (128, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 64, 64), 'float32'), ('TENSOR', (256, 512, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 32, 32), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 32, 32), 'float32'), ('TENSOR', (1024, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 64, 64), 'float32'), ('TENSOR', (1024, 512, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 1024, 32, 32), 'float32'), ('TENSOR', (256, 1024, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 1024, 32, 32), 'float32'), ('TENSOR', (84, 1024, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 1024, 32, 32), 'float32'), ('TENSOR', (512, 1024, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 16, 16), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 16, 16), 'float32'), ('TENSOR', (2048, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 1024, 32, 32), 'float32'), ('TENSOR', (2048, 1024, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 2048, 16, 16), 'float32'), ('TENSOR', (512, 2048, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 2048, 16, 16), 'float32'), ('TENSOR', (126, 2048, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 16, 16), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 8, 8), 'float32'), ('TENSOR', (126, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 8, 8), 'float32'), ('TENSOR', (512, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 8, 8), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 4, 4), 'float32'), ('TENSOR', (126, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 4, 4), 'float32'), ('TENSOR', (256, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 4, 4), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 2, 2), 'float32'), ('TENSOR', (84, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 2, 2), 'float32'), ('TENSOR', (256, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 2, 2), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 1, 1), 'float32'), ('TENSOR', (84, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 1024, 32, 32), 'float32'), ('TENSOR', (16, 1024, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 2048, 16, 16), 'float32'), ('TENSOR', (24, 2048, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 8, 8), 'float32'), ('TENSOR', (24, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 4, 4), 'float32'), ('TENSOR', (24, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 2, 2), 'float32'), ('TENSOR', (16, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 1, 1), 'float32'), ('TENSOR', (16, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
-
 
 
 Display result
@@ -319,7 +195,7 @@ Display result
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 1 minutes  48.793 seconds)
+   **Total running time of the script:** ( 1 minutes  54.583 seconds)
 
 
 .. _sphx_glr_download_tutorials_frontend_deploy_ssd_gluoncv.py:
diff --git a/docs/_sources/tutorials/frontend/from_onnx.rst.txt b/docs/_sources/tutorials/frontend/from_onnx.rst.txt
index 36de415..d30adb4 100644
--- a/docs/_sources/tutorials/frontend/from_onnx.rst.txt
+++ b/docs/_sources/tutorials/frontend/from_onnx.rst.txt
@@ -130,7 +130,7 @@ Compile the model with relay
 
  .. code-block:: none
 
-    /workspace/docs/../python/tvm/relay/frontend/onnx.py:2694: UserWarning: Mismatched attribute type in ' : kernel_shape'
+    /workspace/docs/../python/tvm/relay/frontend/onnx.py:2736: UserWarning: Mismatched attribute type in ' : kernel_shape'
 
     ==> Context: Bad node spec: input: "1" input: "2" output: "11" op_type: "Conv" attribute { name: "kernel_shape" ints: 5 ints: 5 } attribute { name: "strides" ints: 1 ints: 1 } attribute { name: "pads" ints: 2 ints: 2 ints: 2 ints: 2 } attribute { name: "dilations" ints: 1 ints: 1 } attribute { name: "group" i: 1 }
       warnings.warn(str(e))
@@ -150,17 +150,6 @@ Execute on TVM
 
 
 
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-
    ...47%, 0.01 MB, 31 KB/s, 0 seconds passed
    ...94%, 0.02 MB, 62 KB/s, 0 seconds passed
    ...100%, 0.02 MB, 93 KB/s, 0 seconds passed
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 32, 224, 224), 'float32'), ('TENSOR', (9, 32, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 64, 224, 224), 'float32'), ('TENSOR', (32, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-    Cannot find config for target=llvm -keys=cpu, workload=('conv2d_NCHWc.x86', ('TENSOR', (1, 1, 224, 224), 'float32'), ('TENSOR', (64, 1, 5, 5), 'float32'), (1, 1), (2, 2, 2, 2), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
-
 
 
 Display results
diff --git a/docs/_sources/tutorials/frontend/from_pytorch.rst.txt b/docs/_sources/tutorials/frontend/from_pytorch.rst.txt
index a559b95..ad3dcc7 100644
--- a/docs/_sources/tutorials/frontend/from_pytorch.rst.txt
+++ b/docs/_sources/tutorials/frontend/from_pytorch.rst.txt
@@ -149,6 +149,15 @@ Compile the graph to llvm target with given input specification.
 
 
 
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+
    ...47%, 0.01 MB, 40 KB/s, 0 seconds passed
    ...94%, 0.02 MB, 81 KB/s, 0 seconds passed
    ...100%, 0.02 MB, 121 KB/s, 0 seconds passed
+    Cannot find config for target=llvm -keys=cpu, workload=('dense_nopack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
+
 
 
 Execute the portable graph on TVM
diff --git a/docs/_sources/tutorials/frontend/from_tensorflow.rst.txt b/docs/_sources/tutorials/frontend/from_tensorflow.rst.txt
index aabfc36..d88b140 100644
--- a/docs/_sources/tutorials/frontend/from_tensorflow.rst.txt
+++ b/docs/_sources/tutorials/frontend/from_tensorflow.rst.txt
@@ -195,10 +195,1971 @@ Results:
 
  .. code-block:: none
 
-    /workspace/docs/../python/tvm/relay/frontend/tensorflow.py:2948: UserWarning: Ignore the passed shape. Shape in graphdef will be used for operator DecodeJpeg/contents.
+    /workspace/docs/../python/tvm/relay/frontend/tensorflow.py:2949: UserWarning: Ignore the passed shape. Shape in graphdef will be used for operator DecodeJpeg/contents.
       "will be used for operator %s." % node.name
     /workspace/docs/../python/tvm/relay/frontend/tensorflow.py:735: UserWarning: DecodeJpeg: It's a pass through, please handle preprocessing before input
       warnings.warn("DecodeJpeg: It's a pass through, please handle preprocessing before input")
+    WARNING:root:Attribute Tdim is ignored in relay.sym.expand_dims
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.expand_dims
+    WARNING:root:Attribute T is ignored in relay.sym.expand_dims
+    WARNING:root:Attribute _node_name is ignored in relay.sym.expand_dims
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.expand_dims
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.resize
+    WARNING:root:Attribute half_pixel_centers is ignored in relay.sym.resize
+    WARNING:root:Attribute T is ignored in relay.sym.resize
+    WARNING:root:Attribute _node_name is ignored in relay.sym.resize
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.resize
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute ksize is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute T is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute ksize is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute ksize is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute T is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.concatenate
+    WARNING:root:Attribute N is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _node_name is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.concatenate
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute ksize is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute N is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.concatenate
+    WARNING:root:Attribute T is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _node_name is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute T is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute ksize is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.concatenate
+    WARNING:root:Attribute N is ignored in relay.sym.concatenate
+    WARNING:root:Attribute T is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _node_name is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.concatenate
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute ksize is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute T is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.max_pool2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.concatenate
+    WARNING:root:Attribute N is ignored in relay.sym.concatenate
+    WARNING:root:Attribute T is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _node_name is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.concatenate
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute T is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute ksize is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.avg_pool2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.concatenate
+    WARNING:root:Attribute N is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _node_name is ignored in relay.sym.concatenate
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.concatenate
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.relu
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.relu
+    WARNING:root:Attribute _node_name is ignored in relay.sym.relu
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.relu
+    WARNING:root:Attribute use_cudnn_on_gpu is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.conv2d
+    WARNING:root:Attribute explicit_paddings is ignored in relay.sym.conv2d
+    WARNING:root:Attribute T is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _node_name is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.conv2d
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute T is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _node_name is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _target_layout is ignored in relay.sym.batch_norm
+    WARNING:root:Attribute _output_shapes is ignored in relay.sym.copy
+    WARNING:root:Attribute message is ignored in relay.sym.copy
+    WARNING:root:Attribute T is ignored in relay.sym.copy
+    WARNING:root:Attribute _node_name is ignored in relay.sym.copy
... 110326 lines suppressed ...